Compare commits

...

41 Commits

Author SHA1 Message Date
Eugene Tolbakov
964d26e415 fix: docker build for aarch64 (#1826) 2023-06-25 18:29:00 +09:00
Yingwen
fd412b7b07 refactor!: Uses table id to locate tables in table engines (#1817)
* refactor: add table_id to get_table()/table_exists()

* refactor: Add table_id to alter table request

* refactor: Add table id to DropTableRequest

* refactor: add table id to DropTableRequest

* refactor: Use table id as key for the tables map

* refactor: use table id as file engine's map key

* refactor: Remove table reference from engine's get_table/table_exists

* style: remove unused imports

* feat!: Add table id to TableRegionalValue

* style: fix cilppy

* chore: add comments and logs
2023-06-25 15:05:20 +08:00
Weny Xu
223cf31409 feat: support to copy from orc format (#1814)
* feat: support to copy from orc format

* test: add copy from orc test

* chore: add license header

* refactor: remove unimplemented macro

* chore: apply suggestions from CR

* chore: bump orc-rust to 0.2.3
2023-06-25 14:07:16 +08:00
Ruihang Xia
62f660e439 feat: implement metrics for Scan plan (#1812)
* add metrics in some interfaces

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* calc elapsed time and rows

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

---------

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
2023-06-25 14:06:50 +08:00
Lei, HUANG
0fb18245b8 fix: docker build (#1822) 2023-06-25 11:05:46 +08:00
Weny Xu
caed6879e6 refactor: remove redundant code (#1821) 2023-06-25 10:56:31 +08:00
Yingwen
5ab0747092 test(storage): wait task before checking scheduled task num (#1811) 2023-06-21 18:04:34 +08:00
Ruihang Xia
b1ccc7ef5d fix: prevent filter pushdown in distributed planner (#1806)
* fix: prevent filter pushdown in distributed planner

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* fix metadata

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

---------

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
2023-06-21 16:25:50 +08:00
Lei, HUANG
d1b5ce0d35 chore: check catalog deregister result (#1810)
* chore: check deregister result and return error on failure

* refactor: SystemCatalog::deregister_table returns Result<()>
2023-06-21 08:09:11 +00:00
Lei, HUANG
a314993ab4 chore: change logstore default config (#1809) 2023-06-21 07:34:24 +00:00
LFC
fa522bc579 fix: drop region alive countdown tasks when deregistering table (#1808) 2023-06-21 14:49:32 +08:00
Lei, HUANG
5335203360 feat: support cross compilation to aarch64 linux (#1802) 2023-06-21 14:08:45 +08:00
Ruihang Xia
23bf55a265 fix: __field__ matcher on single value column (#1805)
* fix error text and field_column_names

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* add sqlness test

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* add empty line

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* improve style

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

---------

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
2023-06-21 10:59:58 +08:00
Eugene Tolbakov
3b91fc2c64 feat: add initial implementation for status endpoint (#1789)
* feat: add initial implementation for status endpoint

* feat(status_endpoint): add more data to response

* feat(status_endpoint): use build data env vars

* feat(status_endpoint): add simple test

* fix(status_endpoint): adjust the toml indentation
2023-06-21 10:50:08 +08:00
LFC
6205616301 fix: filter table regional values with the current node id (#1800) 2023-06-20 19:17:35 +08:00
JeremyHi
e47ef1f0d2 chore: minor fix (#1801) 2023-06-20 11:03:52 +00:00
Lei, HUANG
16c1ee2618 feat: incremental database backup (#1240)
* feat: incremental database backup

* chore: rebase develop

* chore: move backup to StatementExecutor

* feat: copy database parser

* chore: remove some todos

* chore: use timestamp string instead of i64 string

* fix: typo
2023-06-20 18:26:55 +08:00
JeremyHi
323e2aed07 feat: deal with more than 128 txn (#1799) 2023-06-20 17:56:45 +08:00
LFC
cbc2620a59 feat: start region alive keepers (#1796)
* feat: start region alive keepers
2023-06-20 15:45:29 +08:00
JeremyHi
4fdee5ea3c feat: deal with node epoch (#1795)
* feat: deal with node epoch

* feat: dn send node_epoch

* Update src/meta-srv/src/handler/persist_stats_handler.rs

Co-authored-by: dennis zhuang <killme2008@gmail.com>

* Update src/meta-srv/src/service/store/ext.rs

Co-authored-by: dennis zhuang <killme2008@gmail.com>

* chore: by cr

---------

Co-authored-by: dennis zhuang <killme2008@gmail.com>
2023-06-20 07:07:05 +00:00
dennis zhuang
30472cebae feat: prepare supports caching logical plan and infering param types (#1776)
* feat: change do_describe function signature

* feat: infer param type and cache logical plan for msyql prepared statments

* fix: convert_value

* fix: forgot helper

* chore: comments

* fix: typo

* test: add more tests and test date, datatime in mysql

* chore: fix CR comments

* chore: add location

* chore: by CR comments

* Update tests-integration/tests/sql.rs

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>

* chore: remove the trace

---------

Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
2023-06-20 04:07:28 +00:00
Ruihang Xia
903f02bf10 ci: optimize release progress (#1794)
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
2023-06-20 11:39:53 +08:00
JeremyHi
1703e93e15 feat: add handler execution timer (#1791)
* feat: add handler execution timer

* fix: by cr
2023-06-20 11:25:13 +08:00
LFC
2dd86b686f feat: extend region leases in Metasrv (#1784)
* feat: extend region leases in Metasrv

* fix: resolve PR comments
2023-06-19 19:55:59 +08:00
LFC
128c6ec98c feat: region alive keeper in Datanode (#1780) 2023-06-19 14:50:33 +08:00
Lei, HUANG
960b84262b fix: abort parquet writer (#1785)
* fix: sst file size

* fix: avoid creating file when no row's been written

* chore: rename tests

* fix: some clippy issues

* fix: some cr comments
2023-06-19 03:19:31 +00:00
Lei, HUANG
69854c07c5 fix: wait for compaction task to finish (#1783) 2023-06-16 16:45:06 +08:00
JeremyHi
1eeb5b4330 feat: disable_region_failover option for metasrv (#1777) 2023-06-15 16:26:27 +08:00
LFC
9b3037fe97 feat: a countdown task for closing region in Datanode (#1775) 2023-06-14 15:50:21 +08:00
dennis zhuang
09747ea206 feat: use DataFrame to replace SQL for Prometheus remote read (#1774)
* feat: debug QueryEngineState

* feat: impl read_table to create DataFrame for a table

* fix: clippy warnings

* feat: use DataFrame to handle prometheus remote read quries

* Update src/frontend/src/instance/prometheus.rs

Co-authored-by: LFC <bayinamine@gmail.com>

* chore: CR comments

---------

Co-authored-by: LFC <bayinamine@gmail.com>
2023-06-14 07:39:28 +00:00
Lei, HUANG
fb35e09072 chore: fix compaction caused race condition (#1767)
fix: unit tests. For real, this time.
2023-06-13 21:03:09 +08:00
Weny Xu
803940cfa4 feat: enable azblob tests (#1765)
* feat: enable azblob tests

* fix: add missing arg
2023-06-13 07:44:57 +00:00
Weny Xu
420ae054b3 chore: add debug log for heartbeat (#1770) 2023-06-13 07:43:26 +00:00
Lei, HUANG
0f1e061f24 fix: compile issue on develop and workaround to fix failing tests cau… (#1771)
* fix: compile issue on develop and workaround to fix failing tests caused by logstore file lock

* Apply suggestions from code review

Co-authored-by: JeremyHi <jiachun_feng@proton.me>

---------

Co-authored-by: JeremyHi <jiachun_feng@proton.me>
2023-06-13 07:30:16 +00:00
Lei, HUANG
7961de25ad feat: persist compaction time window (#1757)
* feat: persist compaction time window

* refactor: remove useless compaction window fields

* chore: revert some useless change

* fix: some CR comments

* fix: comment out unstable sqlness test

* revert commented sqlness
2023-06-13 10:15:42 +08:00
Lei, HUANG
f7d98e533b chore: fix compaction caused race condition (#1759)
* fix: set max_files_in_l0 in unit tests to avoid compaction

* refactor: pass while EngineConfig

* fix: comment out unstable sqlness test

* revert commented sqlness
2023-06-12 11:19:42 +00:00
Ruihang Xia
b540d640cf fix: unstable order with union operation (#1763)
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
2023-06-12 18:16:24 +08:00
Eugene Tolbakov
51a4d660b7 feat(to_unixtime): add timestamp types as arguments (#1632)
* feat(to_unixtime): add timestamp types as arguments

* feat(to_unixtime): change the return type

* feat(to_unixtime): address code review issues

* feat(to_unixtime): fix fmt issue
2023-06-12 17:21:49 +08:00
Ruihang Xia
1b2381502e fix: bring EnforceSorting rule forward (#1754)
* fix: bring EnforceSorting rule forward

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* remove duplicated rules

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* wrap remove logic into a method

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

---------

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
2023-06-12 07:29:08 +00:00
Yingwen
0e937be3f5 fix(storage): Use region_write_buffer_size as default value (#1760) 2023-06-12 15:05:17 +08:00
Weny Xu
564c183607 chore: make MetaKvBackend public (#1761) 2023-06-12 14:13:26 +08:00
192 changed files with 5375 additions and 1419 deletions

View File

@@ -20,6 +20,3 @@ out/
# Rust
target/
# Git
.git

View File

@@ -35,6 +35,7 @@ jobs:
build-macos:
name: Build macOS binary
strategy:
fail-fast: false
matrix:
# The file format is greptime-<os>-<arch>
include:
@@ -129,6 +130,7 @@ jobs:
build-linux:
name: Build linux binary
strategy:
fail-fast: false
matrix:
# The file format is greptime-<os>-<arch>
include:

2
.gitignore vendored
View File

@@ -44,3 +44,5 @@ benchmarks/data
# Vscode workspace
*.code-workspace
venv/

137
Cargo.lock generated
View File

@@ -209,8 +209,8 @@ dependencies = [
"greptime-proto",
"prost",
"snafu",
"tonic 0.9.2",
"tonic-build 0.9.2",
"tonic",
"tonic-build",
]
[[package]]
@@ -382,7 +382,7 @@ dependencies = [
"paste",
"prost",
"tokio",
"tonic 0.9.2",
"tonic",
]
[[package]]
@@ -1538,7 +1538,7 @@ dependencies = [
"substrait 0.7.5",
"tokio",
"tokio-stream",
"tonic 0.9.2",
"tonic",
"tracing",
"tracing-subscriber",
]
@@ -1679,6 +1679,7 @@ dependencies = [
"derive_builder 0.12.0",
"futures",
"object-store",
"orc-rust",
"paste",
"regex",
"snafu",
@@ -1760,7 +1761,7 @@ dependencies = [
"rand",
"snafu",
"tokio",
"tonic 0.9.2",
"tonic",
"tower",
]
@@ -1801,6 +1802,7 @@ name = "common-meta"
version = "0.4.0"
dependencies = [
"api",
"async-trait",
"chrono",
"common-catalog",
"common-error",
@@ -2005,7 +2007,7 @@ checksum = "c2895653b4d9f1538a83970077cb01dfc77a4810524e51a110944688e916b18e"
dependencies = [
"prost",
"prost-types",
"tonic 0.9.2",
"tonic",
"tracing-core",
]
@@ -2027,7 +2029,7 @@ dependencies = [
"thread_local",
"tokio",
"tokio-stream",
"tonic 0.9.2",
"tonic",
"tracing",
"tracing-core",
"tracing-subscriber",
@@ -2647,7 +2649,7 @@ dependencies = [
"tokio",
"tokio-stream",
"toml",
"tonic 0.9.2",
"tonic",
"tower",
"tower-http",
"url",
@@ -3025,16 +3027,16 @@ dependencies = [
[[package]]
name = "etcd-client"
version = "0.10.4"
version = "0.11.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4319dc0fb739a6e84cb8678b8cf50c9bcfa4712ae826b33ecf00cc0850550a58"
checksum = "f4b0ea5ef6dc2388a4b1669fa32097249bc03a15417b97cb75e38afb309e4a89"
dependencies = [
"http",
"prost",
"tokio",
"tokio-stream",
"tonic 0.8.3",
"tonic-build 0.8.4",
"tonic",
"tonic-build",
"tower",
"tower-service",
]
@@ -3068,6 +3070,12 @@ version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4443176a9f2c162692bd3d352d745ef9413eec5782a80d8fd6f8a1ac692a07f7"
[[package]]
name = "fallible-streaming-iterator"
version = "0.1.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7360491ce676a36bf9bb3c56c1aa791658183a54d2744120f27285738d90465a"
[[package]]
name = "fastrand"
version = "1.9.0"
@@ -3221,6 +3229,7 @@ dependencies = [
"common-runtime",
"common-telemetry",
"common-test-util",
"common-time",
"datafusion",
"datafusion-common",
"datafusion-expr",
@@ -3257,7 +3266,7 @@ dependencies = [
"table",
"tokio",
"toml",
"tonic 0.9.2",
"tonic",
"tower",
"uuid",
]
@@ -4096,13 +4105,13 @@ checksum = "d2fabcfbdc87f4758337ca535fb41a6d701b65693ce38287d856d1674551ec9b"
[[package]]
name = "greptime-proto"
version = "0.1.0"
source = "git+https://github.com/GreptimeTeam/greptime-proto.git?rev=4398d20c56d5f7939cc2960789cb1fa7dd18e6fe#4398d20c56d5f7939cc2960789cb1fa7dd18e6fe"
source = "git+https://github.com/GreptimeTeam/greptime-proto.git?rev=7aeaeaba1e0ca6a5c736b6ab2eb63144ae3d284b#7aeaeaba1e0ca6a5c736b6ab2eb63144ae3d284b"
dependencies = [
"prost",
"serde",
"serde_json",
"tonic 0.9.2",
"tonic-build 0.9.2",
"tonic",
"tonic-build",
]
[[package]]
@@ -5141,7 +5150,7 @@ dependencies = [
"table",
"tokio",
"tokio-stream",
"tonic 0.9.2",
"tonic",
"tower",
"tracing",
"tracing-subscriber",
@@ -5185,10 +5194,11 @@ dependencies = [
"serde_json",
"servers",
"snafu",
"store-api",
"table",
"tokio",
"tokio-stream",
"tonic 0.9.2",
"tonic",
"tower",
"tracing",
"tracing-subscriber",
@@ -5980,6 +5990,27 @@ version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "978aa494585d3ca4ad74929863093e87cac9790d81fe7aba2b3dc2890643a0fc"
[[package]]
name = "orc-rust"
version = "0.2.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e15d3f67795da54d9526e46b7808181ce6236d518f56ca1ee556d3a3fdd77c66"
dependencies = [
"arrow",
"bytes",
"chrono",
"fallible-streaming-iterator",
"flate2",
"futures",
"futures-util",
"lazy_static",
"paste",
"prost",
"snafu",
"tokio",
"zigzag",
]
[[package]]
name = "ordered-float"
version = "1.1.1"
@@ -8517,6 +8548,7 @@ dependencies = [
"axum-macros",
"axum-test-helper",
"base64 0.13.1",
"build-data",
"bytes",
"catalog",
"chrono",
@@ -8534,6 +8566,9 @@ dependencies = [
"common-telemetry",
"common-test-util",
"common-time",
"datafusion",
"datafusion-common",
"datafusion-expr",
"datatypes",
"derive_builder 0.12.0",
"digest",
@@ -8583,7 +8618,7 @@ dependencies = [
"tokio-rustls 0.24.0",
"tokio-stream",
"tokio-test",
"tonic 0.9.2",
"tonic",
"tonic-reflection",
"tower",
"tower-http",
@@ -8970,6 +9005,7 @@ dependencies = [
"bitflags 1.3.2",
"byteorder",
"bytes",
"chrono",
"crc",
"crossbeam-queue",
"digest",
@@ -9137,8 +9173,8 @@ dependencies = [
"table",
"tokio",
"tokio-util",
"tonic 0.9.2",
"tonic-build 0.9.2",
"tonic",
"tonic-build",
"uuid",
]
@@ -9550,6 +9586,7 @@ dependencies = [
"axum",
"axum-test-helper",
"catalog",
"chrono",
"client",
"common-base",
"common-catalog",
@@ -9595,7 +9632,7 @@ dependencies = [
"table",
"tempfile",
"tokio",
"tonic 0.9.2",
"tonic",
"tower",
"uuid",
]
@@ -9970,38 +10007,6 @@ dependencies = [
"winnow",
]
[[package]]
name = "tonic"
version = "0.8.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8f219fad3b929bef19b1f86fbc0358d35daed8f2cac972037ac0dc10bbb8d5fb"
dependencies = [
"async-stream",
"async-trait",
"axum",
"base64 0.13.1",
"bytes",
"futures-core",
"futures-util",
"h2",
"http",
"http-body",
"hyper",
"hyper-timeout",
"percent-encoding",
"pin-project",
"prost",
"prost-derive",
"tokio",
"tokio-stream",
"tokio-util",
"tower",
"tower-layer",
"tower-service",
"tracing",
"tracing-futures",
]
[[package]]
name = "tonic"
version = "0.9.2"
@@ -10033,19 +10038,6 @@ dependencies = [
"tracing",
]
[[package]]
name = "tonic-build"
version = "0.8.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5bf5e9b9c0f7e0a7c027dcfaba7b2c60816c7049171f679d99ee2ff65d0de8c4"
dependencies = [
"prettyplease 0.1.25",
"proc-macro2",
"prost-build",
"quote",
"syn 1.0.109",
]
[[package]]
name = "tonic-build"
version = "0.9.2"
@@ -10069,7 +10061,7 @@ dependencies = [
"prost-types",
"tokio",
"tokio-stream",
"tonic 0.9.2",
"tonic",
]
[[package]]
@@ -11250,6 +11242,15 @@ version = "1.6.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2a0956f1ba7c7909bfb66c2e9e4124ab6f6482560f6628b5aaeba39207c9aad9"
[[package]]
name = "zigzag"
version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "70b40401a28d86ce16a330b863b86fd7dbee4d7c940587ab09ab8c019f9e3fdf"
dependencies = [
"num-traits",
]
[[package]]
name = "zstd"
version = "0.11.2+zstd.1.5.2"

View File

@@ -72,7 +72,7 @@ datafusion-sql = { git = "https://github.com/waynexia/arrow-datafusion.git", rev
datafusion-substrait = { git = "https://github.com/waynexia/arrow-datafusion.git", rev = "63e52dde9e44cac4b1f6c6e6b6bf6368ba3bd323" }
futures = "0.3"
futures-util = "0.3"
greptime-proto = { git = "https://github.com/GreptimeTeam/greptime-proto.git", rev = "4398d20c56d5f7939cc2960789cb1fa7dd18e6fe" }
greptime-proto = { git = "https://github.com/GreptimeTeam/greptime-proto.git", rev = "7aeaeaba1e0ca6a5c736b6ab2eb63144ae3d284b" }
itertools = "0.10"
parquet = "40.0"
paste = "1.0"

7
Cross.toml Normal file
View File

@@ -0,0 +1,7 @@
[build]
pre-build = [
"dpkg --add-architecture $CROSS_DEB_ARCH",
"apt update && apt install -y unzip zlib1g-dev:$CROSS_DEB_ARCH",
"curl -LO https://github.com/protocolbuffers/protobuf/releases/download/v3.15.8/protoc-3.15.8-linux-x86_64.zip && unzip protoc-3.15.8-linux-x86_64.zip -d /usr/",
"chmod a+x /usr/bin/protoc && chmod -R a+rx /usr/include/google",
]

View File

@@ -34,7 +34,7 @@ docker-image: ## Build docker image.
##@ Test
test: nextest ## Run unit and integration tests.
cargo nextest run
cargo nextest run --retries 3
.PHONY: nextest ## Install nextest tools.
nextest:

View File

@@ -26,8 +26,8 @@ tcp_nodelay = true
[wal]
# WAL data directory
# dir = "/tmp/greptimedb/wal"
file_size = "1GB"
purge_threshold = "50GB"
file_size = "256MB"
purge_threshold = "4GB"
purge_interval = "10m"
read_batch_size = 128
sync_write = false

View File

@@ -81,9 +81,9 @@ addr = "127.0.0.1:4004"
# WAL data directory
# dir = "/tmp/greptimedb/wal"
# WAL file size in bytes.
file_size = "1GB"
# WAL purge threshold in bytes.
purge_threshold = "50GB"
file_size = "256MB"
# WAL purge threshold.
purge_threshold = "4GB"
# WAL purge interval in seconds.
purge_interval = "10m"
# WAL read batch size.

View File

@@ -8,6 +8,7 @@ RUN apt-get update && apt-get install -y \
libssl-dev \
protobuf-compiler \
curl \
git \
build-essential \
pkg-config \
python3 \

View File

@@ -8,6 +8,7 @@ RUN apt-get update && apt-get install -y \
libssl-dev \
protobuf-compiler \
curl \
git \
build-essential \
pkg-config \
wget

View File

@@ -91,7 +91,7 @@ pub fn build_table_regional_prefix(
}
/// Table global info has only one key across all datanodes so it does not have `node_id` field.
#[derive(Clone)]
#[derive(Clone, Hash, Eq, PartialEq)]
pub struct TableGlobalKey {
pub catalog_name: String,
pub schema_name: String,
@@ -124,6 +124,14 @@ impl TableGlobalKey {
table_name: captures[3].to_string(),
})
}
pub fn to_raw_key(&self) -> Vec<u8> {
self.to_string().into_bytes()
}
pub fn try_from_raw_key(key: &[u8]) -> Result<Self, Error> {
Self::parse(String::from_utf8_lossy(key))
}
}
/// Table global info contains necessary info for a datanode to create table regions, including
@@ -141,6 +149,10 @@ impl TableGlobalValue {
pub fn table_id(&self) -> TableId {
self.table_info.ident.table_id
}
pub fn engine(&self) -> &str {
&self.table_info.meta.engine
}
}
/// Table regional info that varies between datanode, so it contains a `node_id` field.
@@ -189,6 +201,9 @@ impl TableRegionalKey {
/// region ids allocated by metasrv.
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct TableRegionalValue {
// We can remove the `Option` from the table id once all regional values
// stored in meta have table ids.
pub table_id: Option<TableId>,
pub version: TableVersion,
pub regions_ids: Vec<u32>,
pub engine_name: Option<String>,

View File

@@ -467,10 +467,7 @@ impl CatalogManager for LocalCatalogManager {
.ident
.table_id;
if !self.system.deregister_table(&request, table_id).await? {
return Ok(false);
}
self.system.deregister_table(&request, table_id).await?;
self.catalogs.deregister_table(request).await
}
}

View File

@@ -17,7 +17,7 @@ use std::fmt::Debug;
use std::pin::Pin;
use std::sync::Arc;
pub use client::CachedMetaKvBackend;
pub use client::{CachedMetaKvBackend, MetaKvBackend};
use futures::Stream;
use futures_util::StreamExt;
pub use manager::{RemoteCatalogManager, RemoteCatalogProvider, RemoteSchemaProvider};
@@ -29,6 +29,7 @@ mod manager;
#[cfg(feature = "testing")]
pub mod mock;
pub mod region_alive_keeper;
#[derive(Debug, Clone)]
pub struct Kv(pub Vec<u8>, pub Vec<u8>);

View File

@@ -20,13 +20,14 @@ use std::sync::Arc;
use async_stream::stream;
use async_trait::async_trait;
use common_catalog::consts::{MAX_SYS_TABLE_ID, MITO_ENGINE};
use common_meta::ident::TableIdent;
use common_telemetry::{debug, error, info, warn};
use dashmap::DashMap;
use futures::Stream;
use futures_util::{StreamExt, TryStreamExt};
use metrics::{decrement_gauge, increment_gauge};
use parking_lot::RwLock;
use snafu::{OptionExt, ResultExt};
use snafu::{ensure, OptionExt, ResultExt};
use table::engine::manager::TableEngineManagerRef;
use table::engine::{EngineContext, TableReference};
use table::requests::{CreateTableRequest, OpenTableRequest};
@@ -43,6 +44,7 @@ use crate::helper::{
build_table_regional_prefix, CatalogKey, CatalogValue, SchemaKey, SchemaValue, TableGlobalKey,
TableGlobalValue, TableRegionalKey, TableRegionalValue, CATALOG_KEY_PREFIX,
};
use crate::remote::region_alive_keeper::RegionAliveKeepers;
use crate::remote::{Kv, KvBackendRef};
use crate::{
handle_system_table_request, CatalogManager, CatalogProvider, CatalogProviderRef,
@@ -57,16 +59,23 @@ pub struct RemoteCatalogManager {
catalogs: Arc<RwLock<DashMap<String, CatalogProviderRef>>>,
engine_manager: TableEngineManagerRef,
system_table_requests: Mutex<Vec<RegisterSystemTableRequest>>,
region_alive_keepers: Arc<RegionAliveKeepers>,
}
impl RemoteCatalogManager {
pub fn new(engine_manager: TableEngineManagerRef, node_id: u64, backend: KvBackendRef) -> Self {
pub fn new(
engine_manager: TableEngineManagerRef,
node_id: u64,
backend: KvBackendRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
) -> Self {
Self {
engine_manager,
node_id,
backend,
catalogs: Default::default(),
system_table_requests: Default::default(),
region_alive_keepers,
}
}
@@ -76,6 +85,7 @@ impl RemoteCatalogManager {
catalog_name: catalog_name.to_string(),
backend: self.backend.clone(),
engine_manager: self.engine_manager.clone(),
region_alive_keepers: self.region_alive_keepers.clone(),
}) as _
}
@@ -123,10 +133,17 @@ impl RemoteCatalogManager {
increment_gauge!(crate::metrics::METRIC_CATALOG_MANAGER_CATALOG_COUNT, 1.0);
let region_alive_keepers = self.region_alive_keepers.clone();
joins.push(common_runtime::spawn_bg(async move {
let max_table_id =
initiate_schemas(node_id, backend, engine_manager, &catalog_name, catalog)
.await?;
let max_table_id = initiate_schemas(
node_id,
backend,
engine_manager,
&catalog_name,
catalog,
region_alive_keepers,
)
.await?;
info!(
"Catalog name: {}, max table id allocated: {}",
&catalog_name, max_table_id
@@ -155,6 +172,7 @@ impl RemoteCatalogManager {
self.engine_manager.clone(),
catalog_name,
schema_name,
self.region_alive_keepers.clone(),
);
let catalog_provider = self.new_catalog_provider(catalog_name);
@@ -200,6 +218,7 @@ fn new_schema_provider(
engine_manager: TableEngineManagerRef,
catalog_name: &str,
schema_name: &str,
region_alive_keepers: Arc<RegionAliveKeepers>,
) -> SchemaProviderRef {
Arc::new(RemoteSchemaProvider {
catalog_name: catalog_name.to_string(),
@@ -207,6 +226,7 @@ fn new_schema_provider(
node_id,
backend,
engine_manager,
region_alive_keepers,
}) as _
}
@@ -240,6 +260,7 @@ async fn initiate_schemas(
engine_manager: TableEngineManagerRef,
catalog_name: &str,
catalog: CatalogProviderRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
) -> Result<u32> {
let mut schemas = iter_remote_schemas(&backend, catalog_name).await;
let mut joins = Vec::new();
@@ -259,6 +280,7 @@ async fn initiate_schemas(
engine_manager.clone(),
&catalog_name,
&schema_name,
region_alive_keepers.clone(),
);
catalog
.register_schema(schema_name.clone(), schema.clone())
@@ -576,34 +598,33 @@ impl CatalogManager for RemoteCatalogManager {
}
async fn register_table(&self, request: RegisterTableRequest) -> Result<bool> {
let catalog_name = request.catalog;
let schema_name = request.schema;
let catalog = &request.catalog;
let schema = &request.schema;
let table_name = &request.table_name;
let schema_provider = self
.catalog(&catalog_name)
.catalog(catalog)
.await?
.context(CatalogNotFoundSnafu {
catalog_name: &catalog_name,
catalog_name: catalog,
})?
.schema(&schema_name)
.schema(schema)
.await?
.with_context(|| SchemaNotFoundSnafu {
catalog: &catalog_name,
schema: &schema_name,
})?;
if schema_provider.table_exist(&request.table_name).await? {
return TableExistsSnafu {
table: format!("{}.{}.{}", &catalog_name, &schema_name, &request.table_name),
.context(SchemaNotFoundSnafu { catalog, schema })?;
ensure!(
!schema_provider.table_exist(table_name).await?,
TableExistsSnafu {
table: common_catalog::format_full_table_name(catalog, schema, table_name),
}
.fail();
}
);
increment_gauge!(
crate::metrics::METRIC_CATALOG_MANAGER_TABLE_COUNT,
1.0,
&[crate::metrics::db_label(&catalog_name, &schema_name)],
&[crate::metrics::db_label(catalog, schema)],
);
schema_provider
.register_table(request.table_name, request.table)
.register_table(table_name.to_string(), request.table)
.await?;
Ok(true)
@@ -626,7 +647,22 @@ impl CatalogManager for RemoteCatalogManager {
1.0,
&[crate::metrics::db_label(catalog_name, schema_name)],
);
Ok(result.is_none())
if let Some(table) = result.as_ref() {
let table_info = table.table_info();
let table_ident = TableIdent {
catalog: request.catalog,
schema: request.schema,
table: request.table_name,
table_id: table_info.ident.table_id,
engine: table_info.meta.engine.clone(),
};
self.region_alive_keepers
.deregister_table(&table_ident)
.await;
}
Ok(true)
}
async fn register_schema(&self, request: RegisterSchemaRequest) -> Result<bool> {
@@ -644,6 +680,7 @@ impl CatalogManager for RemoteCatalogManager {
self.engine_manager.clone(),
&catalog_name,
&schema_name,
self.region_alive_keepers.clone(),
);
catalog_provider
.register_schema(schema_name, schema_provider)
@@ -779,6 +816,7 @@ pub struct RemoteCatalogProvider {
catalog_name: String,
backend: KvBackendRef,
engine_manager: TableEngineManagerRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
}
impl RemoteCatalogProvider {
@@ -787,12 +825,14 @@ impl RemoteCatalogProvider {
backend: KvBackendRef,
engine_manager: TableEngineManagerRef,
node_id: u64,
region_alive_keepers: Arc<RegionAliveKeepers>,
) -> Self {
Self {
node_id,
catalog_name,
backend,
engine_manager,
region_alive_keepers,
}
}
@@ -810,6 +850,7 @@ impl RemoteCatalogProvider {
node_id: self.node_id,
backend: self.backend.clone(),
engine_manager: self.engine_manager.clone(),
region_alive_keepers: self.region_alive_keepers.clone(),
};
Arc::new(provider) as Arc<_>
}
@@ -872,6 +913,7 @@ pub struct RemoteSchemaProvider {
node_id: u64,
backend: KvBackendRef,
engine_manager: TableEngineManagerRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
}
impl RemoteSchemaProvider {
@@ -881,6 +923,7 @@ impl RemoteSchemaProvider {
node_id: u64,
engine_manager: TableEngineManagerRef,
backend: KvBackendRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
) -> Self {
Self {
catalog_name,
@@ -888,6 +931,7 @@ impl RemoteSchemaProvider {
node_id,
backend,
engine_manager,
region_alive_keepers,
}
}
@@ -910,15 +954,26 @@ impl SchemaProvider for RemoteSchemaProvider {
async fn table_names(&self) -> Result<Vec<String>> {
let key_prefix = build_table_regional_prefix(&self.catalog_name, &self.schema_name);
let iter = self.backend.range(key_prefix.as_bytes());
let table_names = iter
let regional_keys = iter
.map(|kv| {
let Kv(key, _) = kv?;
let regional_key = TableRegionalKey::parse(String::from_utf8_lossy(&key))
.context(InvalidCatalogValueSnafu)?;
Ok(regional_key.table_name)
Ok(regional_key)
})
.try_collect()
.try_collect::<Vec<_>>()
.await?;
let table_names = regional_keys
.into_iter()
.filter_map(|x| {
if x.node_id == self.node_id {
Some(x.table_name)
} else {
None
}
})
.collect();
Ok(table_names)
}
@@ -929,21 +984,29 @@ impl SchemaProvider for RemoteSchemaProvider {
.get(key.as_bytes())
.await?
.map(|Kv(_, v)| {
let TableRegionalValue { engine_name, .. } =
TableRegionalValue::parse(String::from_utf8_lossy(&v))
.context(InvalidCatalogValueSnafu)?;
let reference = TableReference {
catalog: &self.catalog_name,
schema: &self.schema_name,
table: name,
let TableRegionalValue {
table_id,
engine_name,
..
} = TableRegionalValue::parse(String::from_utf8_lossy(&v))
.context(InvalidCatalogValueSnafu)?;
let Some(table_id) = table_id else {
warn!("Cannot find table id for {key}, the value has an old format");
return Ok(None);
};
let engine_name = engine_name.as_deref().unwrap_or(MITO_ENGINE);
let engine = self
.engine_manager
.engine(engine_name)
.context(TableEngineNotFoundSnafu { engine_name })?;
let reference = TableReference {
catalog: &self.catalog_name,
schema: &self.schema_name,
table: name,
};
let table = engine
.get_table(&EngineContext {}, &reference)
.get_table(&EngineContext {}, table_id)
.with_context(|_| OpenTableSnafu {
table_info: reference.to_string(),
})?;
@@ -956,9 +1019,12 @@ impl SchemaProvider for RemoteSchemaProvider {
}
async fn register_table(&self, name: String, table: TableRef) -> Result<Option<TableRef>> {
// Currently, initiate_tables() always call this method to register the table to the schema thus we
// always update the region value.
let table_info = table.table_info();
let table_version = table_info.ident.version;
let table_value = TableRegionalValue {
table_id: Some(table_info.ident.table_id),
version: table_version,
regions_ids: table.table_info().meta.region_numbers.clone(),
engine_name: Some(table_info.meta.engine.clone()),
@@ -970,6 +1036,18 @@ impl SchemaProvider for RemoteSchemaProvider {
&table_value.as_bytes().context(InvalidCatalogValueSnafu)?,
)
.await?;
let table_ident = TableIdent {
catalog: table_info.catalog_name.clone(),
schema: table_info.schema_name.clone(),
table: table_info.name.clone(),
table_id: table_info.ident.table_id,
engine: table_info.meta.engine.clone(),
};
self.region_alive_keepers
.register_table(table_ident, table)
.await?;
debug!(
"Successfully set catalog table entry, key: {}, table value: {:?}",
table_key, table_value
@@ -994,25 +1072,27 @@ impl SchemaProvider for RemoteSchemaProvider {
.get(table_key.as_bytes())
.await?
.map(|Kv(_, v)| {
let TableRegionalValue { engine_name, .. } =
TableRegionalValue::parse(String::from_utf8_lossy(&v))
.context(InvalidCatalogValueSnafu)?;
Ok(engine_name)
let TableRegionalValue {
table_id,
engine_name,
..
} = TableRegionalValue::parse(String::from_utf8_lossy(&v))
.context(InvalidCatalogValueSnafu)?;
Ok(engine_name.and_then(|name| table_id.map(|id| (name, id))))
})
.transpose()?
.flatten();
let engine_name = engine_opt.as_deref().unwrap_or_else(|| {
warn!("Cannot find table engine name for {table_key}");
MITO_ENGINE
});
self.backend.delete(table_key.as_bytes()).await?;
debug!(
"Successfully deleted catalog table entry, key: {}",
table_key
);
let Some((engine_name, table_id)) = engine_opt else {
warn!("Cannot find table id and engine name for {table_key}");
return Ok(None);
};
let reference = TableReference {
catalog: &self.catalog_name,
schema: &self.schema_name,
@@ -1021,9 +1101,9 @@ impl SchemaProvider for RemoteSchemaProvider {
// deregistering table does not necessarily mean dropping the table
let table = self
.engine_manager
.engine(engine_name)
.engine(&engine_name)
.context(TableEngineNotFoundSnafu { engine_name })?
.get_table(&EngineContext {}, &reference)
.get_table(&EngineContext {}, table_id)
.with_context(|_| OpenTableSnafu {
table_info: reference.to_string(),
})?;

View File

@@ -16,8 +16,7 @@ use std::any::Any;
use std::collections::btree_map::Entry;
use std::collections::{BTreeMap, HashMap};
use std::fmt::{Display, Formatter};
use std::str::FromStr;
use std::sync::Arc;
use std::sync::{Arc, RwLock as StdRwLock};
use async_stream::stream;
use common_catalog::consts::{DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME};
@@ -27,9 +26,11 @@ use datatypes::data_type::ConcreteDataType;
use datatypes::schema::{ColumnSchema, Schema};
use datatypes::vectors::StringVector;
use serde::Serializer;
use table::engine::{EngineContext, TableEngine, TableReference};
use table::engine::{CloseTableResult, EngineContext, TableEngine};
use table::metadata::TableId;
use table::requests::{AlterTableRequest, CreateTableRequest, DropTableRequest, OpenTableRequest};
use table::requests::{
AlterTableRequest, CloseTableRequest, CreateTableRequest, DropTableRequest, OpenTableRequest,
};
use table::test_util::MemTable;
use table::TableRef;
use tokio::sync::RwLock;
@@ -165,7 +166,7 @@ impl KvBackend for MockKvBackend {
#[derive(Default)]
pub struct MockTableEngine {
tables: RwLock<HashMap<String, TableRef>>,
tables: StdRwLock<HashMap<TableId, TableRef>>,
}
#[async_trait::async_trait]
@@ -180,19 +181,8 @@ impl TableEngine for MockTableEngine {
_ctx: &EngineContext,
request: CreateTableRequest,
) -> table::Result<TableRef> {
let table_name = request.table_name.clone();
let catalog_name = request.catalog_name.clone();
let schema_name = request.schema_name.clone();
let table_id = request.id;
let default_table_id = "0".to_owned();
let table_id = TableId::from_str(
request
.table_options
.extra_options
.get("table_id")
.unwrap_or(&default_table_id),
)
.unwrap();
let schema = Arc::new(Schema::new(vec![ColumnSchema::new(
"name",
ConcreteDataType::string_datatype(),
@@ -202,16 +192,16 @@ impl TableEngine for MockTableEngine {
let data = vec![Arc::new(StringVector::from(vec!["a", "b", "c"])) as _];
let record_batch = RecordBatch::new(schema, data).unwrap();
let table: TableRef = Arc::new(MemTable::new_with_catalog(
&table_name,
&request.table_name,
record_batch,
table_id,
catalog_name,
schema_name,
request.catalog_name,
request.schema_name,
vec![0],
)) as Arc<_>;
let mut tables = self.tables.write().await;
tables.insert(table_name, table.clone() as TableRef);
let mut tables = self.tables.write().unwrap();
tables.insert(table_id, table.clone() as TableRef);
Ok(table)
}
@@ -220,7 +210,7 @@ impl TableEngine for MockTableEngine {
_ctx: &EngineContext,
request: OpenTableRequest,
) -> table::Result<Option<TableRef>> {
Ok(self.tables.read().await.get(&request.table_name).cloned())
Ok(self.tables.read().unwrap().get(&request.table_id).cloned())
}
async fn alter_table(
@@ -234,25 +224,13 @@ impl TableEngine for MockTableEngine {
fn get_table(
&self,
_ctx: &EngineContext,
table_ref: &TableReference,
table_id: TableId,
) -> table::Result<Option<TableRef>> {
futures::executor::block_on(async {
Ok(self
.tables
.read()
.await
.get(&table_ref.to_string())
.cloned())
})
Ok(self.tables.read().unwrap().get(&table_id).cloned())
}
fn table_exists(&self, _ctx: &EngineContext, table_ref: &TableReference) -> bool {
futures::executor::block_on(async {
self.tables
.read()
.await
.contains_key(&table_ref.to_string())
})
fn table_exists(&self, _ctx: &EngineContext, table_id: TableId) -> bool {
self.tables.read().unwrap().contains_key(&table_id)
}
async fn drop_table(
@@ -263,6 +241,15 @@ impl TableEngine for MockTableEngine {
unimplemented!()
}
async fn close_table(
&self,
_ctx: &EngineContext,
request: CloseTableRequest,
) -> table::Result<CloseTableResult> {
let _ = self.tables.write().unwrap().remove(&request.table_id);
Ok(CloseTableResult::Released(vec![]))
}
async fn close(&self) -> table::Result<()> {
Ok(())
}

View File

@@ -0,0 +1,822 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::HashMap;
use std::future::Future;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use async_trait::async_trait;
use common_meta::error::InvalidProtoMsgSnafu;
use common_meta::heartbeat::handler::{
HandleControl, HeartbeatResponseHandler, HeartbeatResponseHandlerContext,
};
use common_meta::ident::TableIdent;
use common_meta::RegionIdent;
use common_telemetry::{debug, error, info, warn};
use snafu::{OptionExt, ResultExt};
use store_api::storage::RegionNumber;
use table::engine::manager::TableEngineManagerRef;
use table::engine::{CloseTableResult, EngineContext, TableEngineRef};
use table::requests::CloseTableRequest;
use table::TableRef;
use tokio::sync::{mpsc, oneshot, Mutex};
use tokio::task::JoinHandle;
use tokio::time::{Duration, Instant};
use crate::error::{Result, TableEngineNotFoundSnafu};
/// [RegionAliveKeepers] manages all [RegionAliveKeeper] in a scope of tables.
pub struct RegionAliveKeepers {
table_engine_manager: TableEngineManagerRef,
keepers: Arc<Mutex<HashMap<TableIdent, Arc<RegionAliveKeeper>>>>,
heartbeat_interval_millis: u64,
started: AtomicBool,
/// The epoch when [RegionAliveKeepers] is created. It's used to get a monotonically non-decreasing
/// elapsed time when submitting heartbeats to Metasrv (because [Instant] is monotonically
/// non-decreasing). The heartbeat request will carry the duration since this epoch, and the
/// duration acts like an "invariant point" for region's keep alive lease.
epoch: Instant,
}
impl RegionAliveKeepers {
pub fn new(
table_engine_manager: TableEngineManagerRef,
heartbeat_interval_millis: u64,
) -> Self {
Self {
table_engine_manager,
keepers: Arc::new(Mutex::new(HashMap::new())),
heartbeat_interval_millis,
started: AtomicBool::new(false),
epoch: Instant::now(),
}
}
pub async fn find_keeper(&self, table_ident: &TableIdent) -> Option<Arc<RegionAliveKeeper>> {
self.keepers.lock().await.get(table_ident).cloned()
}
pub async fn register_table(&self, table_ident: TableIdent, table: TableRef) -> Result<()> {
let keeper = self.find_keeper(&table_ident).await;
if keeper.is_some() {
return Ok(());
}
let table_engine = self
.table_engine_manager
.engine(&table_ident.engine)
.context(TableEngineNotFoundSnafu {
engine_name: &table_ident.engine,
})?;
let keeper = Arc::new(RegionAliveKeeper::new(
table_engine,
table_ident.clone(),
self.heartbeat_interval_millis,
));
for r in table.table_info().meta.region_numbers.iter() {
keeper.register_region(*r).await;
}
let mut keepers = self.keepers.lock().await;
keepers.insert(table_ident.clone(), keeper.clone());
if self.started.load(Ordering::Relaxed) {
keeper.start().await;
info!("RegionAliveKeeper for table {table_ident} is started!");
} else {
info!("RegionAliveKeeper for table {table_ident} is registered but not started yet!");
}
Ok(())
}
pub async fn deregister_table(
&self,
table_ident: &TableIdent,
) -> Option<Arc<RegionAliveKeeper>> {
self.keepers.lock().await.remove(table_ident).map(|x| {
info!("Deregister RegionAliveKeeper for table {table_ident}");
x
})
}
pub async fn register_region(&self, region_ident: &RegionIdent) {
let table_ident = &region_ident.table_ident;
let Some(keeper) = self.find_keeper(table_ident).await else {
// Alive keeper could be affected by lagging msg, just warn and ignore.
warn!("Alive keeper for region {region_ident} is not found!");
return;
};
keeper.register_region(region_ident.region_number).await
}
pub async fn deregister_region(&self, region_ident: &RegionIdent) {
let table_ident = &region_ident.table_ident;
let Some(keeper) = self.find_keeper(table_ident).await else {
// Alive keeper could be affected by lagging msg, just warn and ignore.
warn!("Alive keeper for region {region_ident} is not found!");
return;
};
let _ = keeper.deregister_region(region_ident.region_number).await;
}
pub async fn start(&self) {
let keepers = self.keepers.lock().await;
for keeper in keepers.values() {
keeper.start().await;
}
self.started.store(true, Ordering::Relaxed);
info!(
"RegionAliveKeepers for tables {:?} are started!",
keepers.keys().map(|x| x.to_string()).collect::<Vec<_>>(),
);
}
pub fn epoch(&self) -> Instant {
self.epoch
}
}
#[async_trait]
impl HeartbeatResponseHandler for RegionAliveKeepers {
fn is_acceptable(&self, ctx: &HeartbeatResponseHandlerContext) -> bool {
!ctx.response.region_leases.is_empty()
}
async fn handle(
&self,
ctx: &mut HeartbeatResponseHandlerContext,
) -> common_meta::error::Result<HandleControl> {
let leases = ctx.response.region_leases.drain(..).collect::<Vec<_>>();
for lease in leases {
let table_ident: TableIdent = match lease
.table_ident
.context(InvalidProtoMsgSnafu {
err_msg: "'table_ident' is missing in RegionLease",
})
.and_then(|x| x.try_into())
{
Ok(x) => x,
Err(e) => {
error!(e; "");
continue;
}
};
let Some(keeper) = self.keepers.lock().await.get(&table_ident).cloned() else {
// Alive keeper could be affected by lagging msg, just warn and ignore.
warn!("Alive keeper for table {table_ident} is not found!");
continue;
};
let start_instant = self.epoch + Duration::from_millis(lease.duration_since_epoch);
let deadline = start_instant + Duration::from_secs(lease.lease_seconds);
keeper.keep_lived(lease.regions, deadline).await;
}
Ok(HandleControl::Continue)
}
}
/// [RegionAliveKeeper] starts a countdown for each region in a table. When deadline is reached,
/// the region will be closed.
/// The deadline is controlled by Metasrv. It works like "lease" for regions: a Datanode submits its
/// opened regions to Metasrv, in heartbeats. If Metasrv decides some region could be resided in this
/// Datanode, it will "extend" the region's "lease", with a deadline for [RegionAliveKeeper] to
/// countdown.
pub struct RegionAliveKeeper {
table_engine: TableEngineRef,
table_ident: TableIdent,
countdown_task_handles: Arc<Mutex<HashMap<RegionNumber, Arc<CountdownTaskHandle>>>>,
heartbeat_interval_millis: u64,
started: AtomicBool,
}
impl RegionAliveKeeper {
fn new(
table_engine: TableEngineRef,
table_ident: TableIdent,
heartbeat_interval_millis: u64,
) -> Self {
Self {
table_engine,
table_ident,
countdown_task_handles: Arc::new(Mutex::new(HashMap::new())),
heartbeat_interval_millis,
started: AtomicBool::new(false),
}
}
async fn find_handle(&self, region: &RegionNumber) -> Option<Arc<CountdownTaskHandle>> {
self.countdown_task_handles
.lock()
.await
.get(region)
.cloned()
}
async fn register_region(&self, region: RegionNumber) {
if self.find_handle(&region).await.is_some() {
return;
}
let countdown_task_handles = Arc::downgrade(&self.countdown_task_handles);
let on_task_finished = async move {
if let Some(x) = countdown_task_handles.upgrade() {
x.lock().await.remove(&region);
} // Else the countdown task handles map could be dropped because the keeper is dropped.
};
let handle = Arc::new(CountdownTaskHandle::new(
self.table_engine.clone(),
self.table_ident.clone(),
region,
|| on_task_finished,
));
let mut handles = self.countdown_task_handles.lock().await;
handles.insert(region, handle.clone());
if self.started.load(Ordering::Relaxed) {
handle.start(self.heartbeat_interval_millis).await;
info!(
"Region alive countdown for region {region} in table {} is started!",
self.table_ident
);
} else {
info!(
"Region alive countdown for region {region} in table {} is registered but not started yet!",
self.table_ident
);
}
}
async fn deregister_region(&self, region: RegionNumber) -> Option<Arc<CountdownTaskHandle>> {
self.countdown_task_handles
.lock()
.await
.remove(&region)
.map(|x| {
info!(
"Deregister alive countdown for region {region} in table {}",
self.table_ident
);
x
})
}
async fn start(&self) {
let handles = self.countdown_task_handles.lock().await;
for handle in handles.values() {
handle.start(self.heartbeat_interval_millis).await;
}
self.started.store(true, Ordering::Relaxed);
info!(
"Region alive countdowns for regions {:?} in table {} are started!",
handles.keys().copied().collect::<Vec<_>>(),
self.table_ident
);
}
async fn keep_lived(&self, designated_regions: Vec<RegionNumber>, deadline: Instant) {
for region in designated_regions {
if let Some(handle) = self.find_handle(&region).await {
handle.reset_deadline(deadline).await;
}
// Else the region alive keeper might be triggered by lagging messages, we can safely ignore it.
}
}
pub async fn deadline(&self, region: RegionNumber) -> Option<Instant> {
let mut deadline = None;
if let Some(handle) = self.find_handle(&region).await {
let (s, r) = oneshot::channel();
if handle.tx.send(CountdownCommand::Deadline(s)).await.is_ok() {
deadline = r.await.ok()
}
}
deadline
}
}
#[derive(Debug)]
enum CountdownCommand {
Start(u64),
Reset(Instant),
Deadline(oneshot::Sender<Instant>),
}
struct CountdownTaskHandle {
tx: mpsc::Sender<CountdownCommand>,
handler: JoinHandle<()>,
table_ident: TableIdent,
region: RegionNumber,
}
impl CountdownTaskHandle {
/// Creates a new [CountdownTaskHandle] and starts the countdown task.
/// # Params
/// - `on_task_finished`: a callback to be invoked when the task is finished. Note that it will not
/// be invoked if the task is cancelled (by dropping the handle). This is because we want something
/// meaningful to be done when the task is finished, e.g. deregister the handle from the map.
/// While dropping the handle does not necessarily mean the task is finished.
fn new<Fut>(
table_engine: TableEngineRef,
table_ident: TableIdent,
region: RegionNumber,
on_task_finished: impl FnOnce() -> Fut + Send + 'static,
) -> Self
where
Fut: Future<Output = ()> + Send,
{
let (tx, rx) = mpsc::channel(1024);
let mut countdown_task = CountdownTask {
table_engine,
table_ident: table_ident.clone(),
region,
rx,
};
let handler = common_runtime::spawn_bg(async move {
countdown_task.run().await;
on_task_finished().await;
});
Self {
tx,
handler,
table_ident,
region,
}
}
async fn start(&self, heartbeat_interval_millis: u64) {
if let Err(e) = self
.tx
.send(CountdownCommand::Start(heartbeat_interval_millis))
.await
{
warn!(
"Failed to start region alive keeper countdown: {e}. \
Maybe the task is stopped due to region been closed."
);
}
}
async fn reset_deadline(&self, deadline: Instant) {
if let Err(e) = self.tx.send(CountdownCommand::Reset(deadline)).await {
warn!(
"Failed to reset region alive keeper deadline: {e}. \
Maybe the task is stopped due to region been closed."
);
}
}
}
impl Drop for CountdownTaskHandle {
fn drop(&mut self) {
debug!(
"Aborting region alive countdown task for region {} in table {}",
self.region, self.table_ident,
);
self.handler.abort();
}
}
struct CountdownTask {
table_engine: TableEngineRef,
table_ident: TableIdent,
region: RegionNumber,
rx: mpsc::Receiver<CountdownCommand>,
}
impl CountdownTask {
async fn run(&mut self) {
// 30 years. See `Instant::far_future`.
let far_future = Instant::now() + Duration::from_secs(86400 * 365 * 30);
// Make sure the alive countdown is not gonna happen before heartbeat task is started (the
// "start countdown" command will be sent from heartbeat task).
let countdown = tokio::time::sleep_until(far_future);
tokio::pin!(countdown);
let region = &self.region;
let table_ident = &self.table_ident;
loop {
tokio::select! {
command = self.rx.recv() => {
match command {
Some(CountdownCommand::Start(heartbeat_interval_millis)) => {
// Set first deadline in 4 heartbeats (roughly after 20 seconds from now if heartbeat
// interval is set to default 5 seconds), to make Datanode and Metasrv more tolerable to
// network or other jitters during startup.
let first_deadline = Instant::now() + Duration::from_millis(heartbeat_interval_millis) * 4;
countdown.set(tokio::time::sleep_until(first_deadline));
},
Some(CountdownCommand::Reset(deadline)) => {
if countdown.deadline() < deadline {
debug!(
"Reset deadline of region {region} of table {table_ident} to approximately {} seconds later",
(deadline - Instant::now()).as_secs_f32(),
);
countdown.set(tokio::time::sleep_until(deadline));
}
// Else the countdown could be either:
// - not started yet;
// - during startup protection;
// - received a lagging heartbeat message.
// All can be safely ignored.
},
None => {
info!(
"The handle of countdown task for region {region} of table {table_ident} \
is dropped, RegionAliveKeeper out."
);
break;
},
Some(CountdownCommand::Deadline(tx)) => {
let _ = tx.send(countdown.deadline());
}
}
}
() = &mut countdown => {
let result = self.close_region().await;
warn!(
"Region {region} of table {table_ident} is closed, result: {result:?}. \
RegionAliveKeeper out.",
);
break;
}
}
}
}
async fn close_region(&self) -> CloseTableResult {
let ctx = EngineContext::default();
let region = self.region;
let table_ident = &self.table_ident;
loop {
let request = CloseTableRequest {
catalog_name: table_ident.catalog.clone(),
schema_name: table_ident.schema.clone(),
table_name: table_ident.table.clone(),
table_id: table_ident.table_id,
region_numbers: vec![region],
flush: true,
};
match self.table_engine.close_table(&ctx, request).await {
Ok(result) => return result,
// If region is failed to close, immediately retry. Maybe we should panic instead?
Err(e) => error!(e;
"Failed to close region {region} of table {table_ident}. \
For the integrity of data, retry closing and retry without wait.",
),
}
}
}
}
#[cfg(test)]
mod test {
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;
use api::v1::meta::{HeartbeatResponse, RegionLease};
use common_meta::heartbeat::mailbox::HeartbeatMailbox;
use datatypes::schema::RawSchema;
use table::engine::manager::MemoryTableEngineManager;
use table::engine::TableEngine;
use table::requests::{CreateTableRequest, TableOptions};
use table::test_util::EmptyTable;
use super::*;
use crate::remote::mock::MockTableEngine;
async fn prepare_keepers() -> (TableIdent, RegionAliveKeepers) {
let table_engine = Arc::new(MockTableEngine::default());
let table_engine_manager = Arc::new(MemoryTableEngineManager::new(table_engine));
let keepers = RegionAliveKeepers::new(table_engine_manager, 5000);
let catalog = "my_catalog";
let schema = "my_schema";
let table = "my_table";
let table_ident = TableIdent {
catalog: catalog.to_string(),
schema: schema.to_string(),
table: table.to_string(),
table_id: 1,
engine: "MockTableEngine".to_string(),
};
let table = Arc::new(EmptyTable::new(CreateTableRequest {
id: 1,
catalog_name: catalog.to_string(),
schema_name: schema.to_string(),
table_name: table.to_string(),
desc: None,
schema: RawSchema {
column_schemas: vec![],
timestamp_index: None,
version: 0,
},
region_numbers: vec![1, 2, 3],
primary_key_indices: vec![],
create_if_not_exists: false,
table_options: TableOptions::default(),
engine: "MockTableEngine".to_string(),
}));
keepers
.register_table(table_ident.clone(), table)
.await
.unwrap();
assert!(keepers.keepers.lock().await.contains_key(&table_ident));
(table_ident, keepers)
}
#[tokio::test(flavor = "multi_thread")]
async fn test_handle_heartbeat_response() {
let (table_ident, keepers) = prepare_keepers().await;
keepers.start().await;
let startup_protection_until = Instant::now() + Duration::from_secs(21);
let duration_since_epoch = (Instant::now() - keepers.epoch).as_millis() as _;
let lease_seconds = 100;
let response = HeartbeatResponse {
region_leases: vec![RegionLease {
table_ident: Some(table_ident.clone().into()),
regions: vec![1, 3], // Not extending region 2's lease time.
duration_since_epoch,
lease_seconds,
}],
..Default::default()
};
let keep_alive_until = keepers.epoch
+ Duration::from_millis(duration_since_epoch)
+ Duration::from_secs(lease_seconds);
let (tx, _) = mpsc::channel(8);
let mailbox = Arc::new(HeartbeatMailbox::new(tx));
let mut ctx = HeartbeatResponseHandlerContext::new(mailbox, response);
assert!(keepers.handle(&mut ctx).await.unwrap() == HandleControl::Continue);
// sleep to wait for background task spawned in `handle`
tokio::time::sleep(Duration::from_secs(1)).await;
async fn test(
keeper: &Arc<RegionAliveKeeper>,
region_number: RegionNumber,
startup_protection_until: Instant,
keep_alive_until: Instant,
is_kept_live: bool,
) {
let deadline = keeper.deadline(region_number).await.unwrap();
if is_kept_live {
assert!(deadline > startup_protection_until && deadline == keep_alive_until);
} else {
assert!(deadline <= startup_protection_until);
}
}
let keeper = &keepers
.keepers
.lock()
.await
.get(&table_ident)
.cloned()
.unwrap();
// Test region 1 and 3 is kept lived. Their deadlines are updated to desired instant.
test(keeper, 1, startup_protection_until, keep_alive_until, true).await;
test(keeper, 3, startup_protection_until, keep_alive_until, true).await;
// Test region 2 is not kept lived. It's deadline is not updated: still during startup protection period.
test(keeper, 2, startup_protection_until, keep_alive_until, false).await;
}
#[tokio::test(flavor = "multi_thread")]
async fn test_region_alive_keepers() {
let (table_ident, keepers) = prepare_keepers().await;
keepers
.register_region(&RegionIdent {
cluster_id: 1,
datanode_id: 1,
table_ident: table_ident.clone(),
region_number: 4,
})
.await;
keepers.start().await;
for keeper in keepers.keepers.lock().await.values() {
let regions = {
let handles = keeper.countdown_task_handles.lock().await;
handles.keys().copied().collect::<Vec<_>>()
};
for region in regions {
// assert countdown tasks are started
let deadline = keeper.deadline(region).await.unwrap();
assert!(deadline <= Instant::now() + Duration::from_secs(20));
}
}
keepers
.deregister_region(&RegionIdent {
cluster_id: 1,
datanode_id: 1,
table_ident: table_ident.clone(),
region_number: 1,
})
.await;
let mut regions = keepers
.find_keeper(&table_ident)
.await
.unwrap()
.countdown_task_handles
.lock()
.await
.keys()
.copied()
.collect::<Vec<_>>();
regions.sort();
assert_eq!(regions, vec![2, 3, 4]);
let keeper = keepers.deregister_table(&table_ident).await.unwrap();
assert!(Arc::try_unwrap(keeper).is_ok(), "keeper is not dropped");
assert!(keepers.keepers.lock().await.is_empty());
}
#[tokio::test(flavor = "multi_thread")]
async fn test_region_alive_keeper() {
let table_engine = Arc::new(MockTableEngine::default());
let table_ident = TableIdent {
catalog: "my_catalog".to_string(),
schema: "my_schema".to_string(),
table: "my_table".to_string(),
table_id: 1024,
engine: "mito".to_string(),
};
let keeper = RegionAliveKeeper::new(table_engine, table_ident, 1000);
let region = 1;
assert!(keeper.find_handle(&region).await.is_none());
keeper.register_region(region).await;
assert!(keeper.find_handle(&region).await.is_some());
let ten_seconds_later = || Instant::now() + Duration::from_secs(10);
keeper.keep_lived(vec![1, 2, 3], ten_seconds_later()).await;
assert!(keeper.find_handle(&2).await.is_none());
assert!(keeper.find_handle(&3).await.is_none());
let far_future = Instant::now() + Duration::from_secs(86400 * 365 * 29);
// assert if keeper is not started, keep_lived is of no use
assert!(keeper.deadline(region).await.unwrap() > far_future);
keeper.start().await;
keeper.keep_lived(vec![1, 2, 3], ten_seconds_later()).await;
// assert keep_lived works if keeper is started
assert!(keeper.deadline(region).await.unwrap() <= ten_seconds_later());
let handle = keeper.deregister_region(region).await.unwrap();
assert!(Arc::try_unwrap(handle).is_ok(), "handle is not dropped");
assert!(keeper.find_handle(&region).await.is_none());
}
#[tokio::test(flavor = "multi_thread")]
async fn test_countdown_task_handle() {
let table_engine = Arc::new(MockTableEngine::default());
let table_ident = TableIdent {
catalog: "my_catalog".to_string(),
schema: "my_schema".to_string(),
table: "my_table".to_string(),
table_id: 1024,
engine: "mito".to_string(),
};
let finished = Arc::new(AtomicBool::new(false));
let finished_clone = finished.clone();
let handle = CountdownTaskHandle::new(
table_engine.clone(),
table_ident.clone(),
1,
|| async move { finished_clone.store(true, Ordering::Relaxed) },
);
let tx = handle.tx.clone();
// assert countdown task is running
assert!(tx.send(CountdownCommand::Start(5000)).await.is_ok());
assert!(!finished.load(Ordering::Relaxed));
drop(handle);
tokio::time::sleep(Duration::from_secs(1)).await;
// assert countdown task is stopped
assert!(tx
.try_send(CountdownCommand::Reset(
Instant::now() + Duration::from_secs(10)
))
.is_err());
// assert `on_task_finished` is not called (because the task is aborted by the handle's drop)
assert!(!finished.load(Ordering::Relaxed));
let finished = Arc::new(AtomicBool::new(false));
let finished_clone = finished.clone();
let handle = CountdownTaskHandle::new(table_engine, table_ident, 1, || async move {
finished_clone.store(true, Ordering::Relaxed)
});
handle.tx.send(CountdownCommand::Start(100)).await.unwrap();
tokio::time::sleep(Duration::from_secs(1)).await;
// assert `on_task_finished` is called when task is finished normally
assert!(finished.load(Ordering::Relaxed));
}
#[tokio::test(flavor = "multi_thread")]
async fn test_countdown_task_run() {
let ctx = &EngineContext::default();
let catalog = "my_catalog";
let schema = "my_schema";
let table = "my_table";
let table_id = 1;
let request = CreateTableRequest {
id: table_id,
catalog_name: catalog.to_string(),
schema_name: schema.to_string(),
table_name: table.to_string(),
desc: None,
schema: RawSchema {
column_schemas: vec![],
timestamp_index: None,
version: 0,
},
region_numbers: vec![],
primary_key_indices: vec![],
create_if_not_exists: false,
table_options: TableOptions::default(),
engine: "mito".to_string(),
};
let table_engine = Arc::new(MockTableEngine::default());
table_engine.create_table(ctx, request).await.unwrap();
let table_ident = TableIdent {
catalog: catalog.to_string(),
schema: schema.to_string(),
table: table.to_string(),
table_id,
engine: "mito".to_string(),
};
let (tx, rx) = mpsc::channel(10);
let mut task = CountdownTask {
table_engine: table_engine.clone(),
table_ident,
region: 1,
rx,
};
common_runtime::spawn_bg(async move {
task.run().await;
});
async fn deadline(tx: &mpsc::Sender<CountdownCommand>) -> Instant {
let (s, r) = oneshot::channel();
tx.send(CountdownCommand::Deadline(s)).await.unwrap();
r.await.unwrap()
}
// if countdown task is not started, its deadline is set to far future
assert!(deadline(&tx).await > Instant::now() + Duration::from_secs(86400 * 365 * 29));
// start countdown in 250ms * 4 = 1s
tx.send(CountdownCommand::Start(250)).await.unwrap();
// assert deadline is correctly set
assert!(deadline(&tx).await <= Instant::now() + Duration::from_secs(1));
// reset countdown in 1.5s
tx.send(CountdownCommand::Reset(
Instant::now() + Duration::from_millis(1500),
))
.await
.unwrap();
// assert the table is closed after deadline is reached
assert!(table_engine.table_exists(ctx, table_id));
// spare 500ms for the task to close the table
tokio::time::sleep(Duration::from_millis(2000)).await;
assert!(!table_engine.table_exists(ctx, table_id));
}
}

View File

@@ -19,6 +19,7 @@ use std::sync::Arc;
use async_trait::async_trait;
use common_catalog::consts::{INFORMATION_SCHEMA_NAME, SYSTEM_CATALOG_TABLE_NAME};
use common_telemetry::logging;
use snafu::ResultExt;
use table::metadata::TableId;
use table::{Table, TableRef};
@@ -91,12 +92,21 @@ impl SystemCatalog {
&self,
request: &DeregisterTableRequest,
table_id: TableId,
) -> CatalogResult<bool> {
) -> CatalogResult<()> {
self.information_schema
.system
.delete(build_table_deletion_request(request, table_id))
.await
.map(|x| x == 1)
.map(|x| {
if x != 1 {
let table = common_catalog::format_full_table_name(
&request.catalog,
&request.schema,
&request.table_name
);
logging::warn!("Failed to delete table record from information_schema, unexpected returned result: {x}, table: {table}");
}
})
.with_context(|_| error::DeregisterTableSnafu {
request: request.clone(),
})

View File

@@ -19,20 +19,38 @@ mod tests {
use std::assert_matches::assert_matches;
use std::collections::HashSet;
use std::sync::Arc;
use std::time::Duration;
use catalog::helper::{CatalogKey, CatalogValue, SchemaKey, SchemaValue};
use catalog::remote::mock::{MockKvBackend, MockTableEngine};
use catalog::remote::region_alive_keeper::RegionAliveKeepers;
use catalog::remote::{
CachedMetaKvBackend, KvBackend, KvBackendRef, RemoteCatalogManager, RemoteCatalogProvider,
RemoteSchemaProvider,
};
use catalog::{CatalogManager, RegisterTableRequest};
use common_catalog::consts::{DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME, MITO_ENGINE};
use common_meta::ident::TableIdent;
use datatypes::schema::RawSchema;
use futures_util::StreamExt;
use table::engine::manager::{MemoryTableEngineManager, TableEngineManagerRef};
use table::engine::{EngineContext, TableEngineRef};
use table::requests::CreateTableRequest;
use table::test_util::EmptyTable;
use tokio::time::Instant;
struct TestingComponents {
kv_backend: KvBackendRef,
catalog_manager: Arc<RemoteCatalogManager>,
table_engine_manager: TableEngineManagerRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
}
impl TestingComponents {
fn table_engine(&self) -> TableEngineRef {
self.table_engine_manager.engine(MITO_ENGINE).unwrap()
}
}
#[tokio::test]
async fn test_backend() {
@@ -120,14 +138,7 @@ mod tests {
assert!(ret.is_none());
}
async fn prepare_components(
node_id: u64,
) -> (
KvBackendRef,
TableEngineRef,
Arc<RemoteCatalogManager>,
TableEngineManagerRef,
) {
async fn prepare_components(node_id: u64) -> TestingComponents {
let cached_backend = Arc::new(CachedMetaKvBackend::wrap(
Arc::new(MockKvBackend::default()),
));
@@ -135,26 +146,34 @@ mod tests {
let table_engine = Arc::new(MockTableEngine::default());
let engine_manager = Arc::new(MemoryTableEngineManager::alias(
MITO_ENGINE.to_string(),
table_engine.clone(),
table_engine,
));
let catalog_manager =
RemoteCatalogManager::new(engine_manager.clone(), node_id, cached_backend.clone());
let region_alive_keepers = Arc::new(RegionAliveKeepers::new(engine_manager.clone(), 5000));
let catalog_manager = RemoteCatalogManager::new(
engine_manager.clone(),
node_id,
cached_backend.clone(),
region_alive_keepers.clone(),
);
catalog_manager.start().await.unwrap();
(
cached_backend,
table_engine,
Arc::new(catalog_manager),
engine_manager as Arc<_>,
)
TestingComponents {
kv_backend: cached_backend,
catalog_manager: Arc::new(catalog_manager),
table_engine_manager: engine_manager,
region_alive_keepers,
}
}
#[tokio::test]
async fn test_remote_catalog_default() {
common_telemetry::init_default_ut_logging();
let node_id = 42;
let (_, _, catalog_manager, _) = prepare_components(node_id).await;
let TestingComponents {
catalog_manager, ..
} = prepare_components(node_id).await;
assert_eq!(
vec![DEFAULT_CATALOG_NAME.to_string()],
catalog_manager.catalog_names().await.unwrap()
@@ -175,14 +194,16 @@ mod tests {
async fn test_remote_catalog_register_nonexistent() {
common_telemetry::init_default_ut_logging();
let node_id = 42;
let (_, table_engine, catalog_manager, _) = prepare_components(node_id).await;
let components = prepare_components(node_id).await;
// register a new table with an nonexistent catalog
let catalog_name = "nonexistent_catalog".to_string();
let schema_name = "nonexistent_schema".to_string();
let table_name = "fail_table".to_string();
// this schema has no effect
let table_schema = RawSchema::new(vec![]);
let table = table_engine
let table = components
.table_engine()
.create_table(
&EngineContext {},
CreateTableRequest {
@@ -208,7 +229,7 @@ mod tests {
table_id: 1,
table,
};
let res = catalog_manager.register_table(reg_req).await;
let res = components.catalog_manager.register_table(reg_req).await;
// because nonexistent_catalog does not exist yet.
assert_matches!(
@@ -220,7 +241,8 @@ mod tests {
#[tokio::test]
async fn test_register_table() {
let node_id = 42;
let (_, table_engine, catalog_manager, _) = prepare_components(node_id).await;
let components = prepare_components(node_id).await;
let catalog_manager = &components.catalog_manager;
let default_catalog = catalog_manager
.catalog(DEFAULT_CATALOG_NAME)
.await
@@ -244,7 +266,8 @@ mod tests {
let table_id = 1;
// this schema has no effect
let table_schema = RawSchema::new(vec![]);
let table = table_engine
let table = components
.table_engine()
.create_table(
&EngineContext {},
CreateTableRequest {
@@ -280,8 +303,10 @@ mod tests {
#[tokio::test]
async fn test_register_catalog_schema_table() {
let node_id = 42;
let (backend, table_engine, catalog_manager, engine_manager) =
prepare_components(node_id).await;
let components = prepare_components(node_id).await;
let backend = &components.kv_backend;
let catalog_manager = components.catalog_manager.clone();
let engine_manager = components.table_engine_manager.clone();
let catalog_name = "test_catalog".to_string();
let schema_name = "nonexistent_schema".to_string();
@@ -290,6 +315,7 @@ mod tests {
backend.clone(),
engine_manager.clone(),
node_id,
components.region_alive_keepers.clone(),
));
// register catalog to catalog manager
@@ -303,7 +329,8 @@ mod tests {
HashSet::from_iter(catalog_manager.catalog_names().await.unwrap().into_iter())
);
let table_to_register = table_engine
let table_to_register = components
.table_engine()
.create_table(
&EngineContext {},
CreateTableRequest {
@@ -350,6 +377,7 @@ mod tests {
node_id,
engine_manager,
backend.clone(),
components.region_alive_keepers.clone(),
));
let prev = new_catalog
@@ -369,4 +397,94 @@ mod tests {
.collect()
)
}
#[tokio::test]
async fn test_register_table_before_and_after_region_alive_keeper_started() {
let components = prepare_components(42).await;
let catalog_manager = &components.catalog_manager;
let region_alive_keepers = &components.region_alive_keepers;
let table_before = TableIdent {
catalog: DEFAULT_CATALOG_NAME.to_string(),
schema: DEFAULT_SCHEMA_NAME.to_string(),
table: "table_before".to_string(),
table_id: 1,
engine: MITO_ENGINE.to_string(),
};
let request = RegisterTableRequest {
catalog: table_before.catalog.clone(),
schema: table_before.schema.clone(),
table_name: table_before.table.clone(),
table_id: table_before.table_id,
table: Arc::new(EmptyTable::new(CreateTableRequest {
id: table_before.table_id,
catalog_name: table_before.catalog.clone(),
schema_name: table_before.schema.clone(),
table_name: table_before.table.clone(),
desc: None,
schema: RawSchema::new(vec![]),
region_numbers: vec![0],
primary_key_indices: vec![],
create_if_not_exists: false,
table_options: Default::default(),
engine: MITO_ENGINE.to_string(),
})),
};
assert!(catalog_manager.register_table(request).await.unwrap());
let keeper = region_alive_keepers
.find_keeper(&table_before)
.await
.unwrap();
let deadline = keeper.deadline(0).await.unwrap();
let far_future = Instant::now() + Duration::from_secs(86400 * 365 * 29);
// assert region alive countdown is not started
assert!(deadline > far_future);
region_alive_keepers.start().await;
let table_after = TableIdent {
catalog: DEFAULT_CATALOG_NAME.to_string(),
schema: DEFAULT_SCHEMA_NAME.to_string(),
table: "table_after".to_string(),
table_id: 2,
engine: MITO_ENGINE.to_string(),
};
let request = RegisterTableRequest {
catalog: table_after.catalog.clone(),
schema: table_after.schema.clone(),
table_name: table_after.table.clone(),
table_id: table_after.table_id,
table: Arc::new(EmptyTable::new(CreateTableRequest {
id: table_after.table_id,
catalog_name: table_after.catalog.clone(),
schema_name: table_after.schema.clone(),
table_name: table_after.table.clone(),
desc: None,
schema: RawSchema::new(vec![]),
region_numbers: vec![0],
primary_key_indices: vec![],
create_if_not_exists: false,
table_options: Default::default(),
engine: MITO_ENGINE.to_string(),
})),
};
assert!(catalog_manager.register_table(request).await.unwrap());
let keeper = region_alive_keepers
.find_keeper(&table_after)
.await
.unwrap();
let deadline = keeper.deadline(0).await.unwrap();
// assert countdown is started for the table registered after [RegionAliveKeepers] started
assert!(deadline <= Instant::now() + Duration::from_secs(20));
let keeper = region_alive_keepers
.find_keeper(&table_before)
.await
.unwrap();
let deadline = keeper.deadline(0).await.unwrap();
// assert countdown is started for the table registered before [RegionAliveKeepers] started, too
assert!(deadline <= Instant::now() + Duration::from_secs(20));
}
}

View File

@@ -52,4 +52,4 @@ serde.workspace = true
toml = "0.5"
[build-dependencies]
build-data = "0.1.3"
build-data = "0.1.4"

View File

@@ -93,6 +93,8 @@ struct StartCommand {
#[clap(long)]
use_memory_store: bool,
#[clap(long)]
disable_region_failover: bool,
#[clap(long)]
http_addr: Option<String>,
#[clap(long)]
http_timeout: Option<u64>,
@@ -134,9 +136,9 @@ impl StartCommand {
.context(error::UnsupportedSelectorTypeSnafu { selector_type })?;
}
if self.use_memory_store {
opts.use_memory_store = true;
}
opts.use_memory_store = self.use_memory_store;
opts.disable_region_failover = self.disable_region_failover;
if let Some(http_addr) = &self.http_addr {
opts.http_opts.addr = http_addr.clone();

View File

@@ -24,6 +24,7 @@ datafusion.workspace = true
derive_builder = "0.12"
futures.workspace = true
object-store = { path = "../../object-store" }
orc-rust = "0.2.3"
regex = "1.7"
snafu.workspace = true
tokio.workspace = true

View File

@@ -12,24 +12,26 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::future::Future;
use arrow::record_batch::RecordBatch;
use async_trait::async_trait;
use datafusion::parquet::format::FileMetaData;
use object_store::Writer;
use snafu::{OptionExt, ResultExt};
use tokio::io::{AsyncWrite, AsyncWriteExt};
use tokio_util::compat::Compat;
use crate::error::{self, Result};
use crate::share_buffer::SharedBuffer;
pub struct BufferedWriter<T, U> {
writer: T,
/// None stands for [`BufferedWriter`] closed.
pub struct LazyBufferedWriter<T, U, F> {
path: String,
writer_factory: F,
writer: Option<T>,
/// None stands for [`LazyBufferedWriter`] closed.
encoder: Option<U>,
buffer: SharedBuffer,
rows_written: usize,
bytes_written: u64,
flushed: bool,
threshold: usize,
}
@@ -42,58 +44,79 @@ pub trait ArrowWriterCloser {
async fn close(mut self) -> Result<FileMetaData>;
}
pub type DefaultBufferedWriter<E> = BufferedWriter<Compat<Writer>, E>;
impl<T: AsyncWrite + Send + Unpin, U: DfRecordBatchEncoder + ArrowWriterCloser>
BufferedWriter<T, U>
impl<
T: AsyncWrite + Send + Unpin,
U: DfRecordBatchEncoder + ArrowWriterCloser,
F: FnMut(String) -> Fut,
Fut: Future<Output = Result<T>>,
> LazyBufferedWriter<T, U, F>
{
/// Closes `LazyBufferedWriter` and optionally flushes all data to underlying storage
/// if any row's been written.
pub async fn close_with_arrow_writer(mut self) -> Result<(FileMetaData, u64)> {
let encoder = self
.encoder
.take()
.context(error::BufferedWriterClosedSnafu)?;
let metadata = encoder.close().await?;
let written = self.try_flush(true).await?;
// Use `rows_written` to keep a track of if any rows have been written.
// If no row's been written, then we can simply close the underlying
// writer without flush so that no file will be actually created.
if self.rows_written != 0 {
self.bytes_written += self.try_flush(true).await?;
}
// It's important to shut down! flushes all pending writes
self.close().await?;
Ok((metadata, written))
self.close_inner_writer().await?;
Ok((metadata, self.bytes_written))
}
}
impl<T: AsyncWrite + Send + Unpin, U: DfRecordBatchEncoder> BufferedWriter<T, U> {
pub async fn close(&mut self) -> Result<()> {
self.writer.shutdown().await.context(error::AsyncWriteSnafu)
impl<
T: AsyncWrite + Send + Unpin,
U: DfRecordBatchEncoder,
F: FnMut(String) -> Fut,
Fut: Future<Output = Result<T>>,
> LazyBufferedWriter<T, U, F>
{
/// Closes the writer without flushing the buffer data.
pub async fn close_inner_writer(&mut self) -> Result<()> {
if let Some(writer) = &mut self.writer {
writer.shutdown().await.context(error::AsyncWriteSnafu)?;
}
Ok(())
}
pub fn new(threshold: usize, buffer: SharedBuffer, encoder: U, writer: T) -> Self {
pub fn new(
threshold: usize,
buffer: SharedBuffer,
encoder: U,
path: impl AsRef<str>,
writer_factory: F,
) -> Self {
Self {
path: path.as_ref().to_string(),
threshold,
writer,
encoder: Some(encoder),
buffer,
rows_written: 0,
bytes_written: 0,
flushed: false,
writer_factory,
writer: None,
}
}
pub fn bytes_written(&self) -> u64 {
self.bytes_written
}
pub async fn write(&mut self, batch: &RecordBatch) -> Result<()> {
let encoder = self
.encoder
.as_mut()
.context(error::BufferedWriterClosedSnafu)?;
encoder.write(batch)?;
self.try_flush(false).await?;
self.rows_written += batch.num_rows();
self.bytes_written += self.try_flush(false).await?;
Ok(())
}
pub fn flushed(&self) -> bool {
self.flushed
}
pub async fn try_flush(&mut self, all: bool) -> Result<u64> {
let mut bytes_written: u64 = 0;
@@ -106,7 +129,8 @@ impl<T: AsyncWrite + Send + Unpin, U: DfRecordBatchEncoder> BufferedWriter<T, U>
};
let size = chunk.len();
self.writer
self.maybe_init_writer()
.await?
.write_all(&chunk)
.await
.context(error::AsyncWriteSnafu)?;
@@ -117,22 +141,27 @@ impl<T: AsyncWrite + Send + Unpin, U: DfRecordBatchEncoder> BufferedWriter<T, U>
if all {
bytes_written += self.try_flush_all().await?;
}
self.flushed = bytes_written > 0;
self.bytes_written += bytes_written;
Ok(bytes_written)
}
/// Only initiates underlying file writer when rows have been written.
async fn maybe_init_writer(&mut self) -> Result<&mut T> {
if let Some(ref mut writer) = self.writer {
Ok(writer)
} else {
let writer = (self.writer_factory)(self.path.clone()).await?;
Ok(self.writer.insert(writer))
}
}
async fn try_flush_all(&mut self) -> Result<u64> {
let remain = self.buffer.buffer.lock().unwrap().split();
let size = remain.len();
self.writer
self.maybe_init_writer()
.await?
.write_all(&remain)
.await
.context(error::AsyncWriteSnafu)?;
Ok(size as u64)
}
}

View File

@@ -54,6 +54,12 @@ pub enum Error {
location: Location,
},
#[snafu(display("Failed to build orc reader, source: {}", source))]
OrcReader {
location: Location,
source: orc_rust::error::Error,
},
#[snafu(display("Failed to read object from path: {}, source: {}", path, source))]
ReadObject {
path: String,
@@ -171,7 +177,8 @@ impl ErrorExt for Error {
| ReadRecordBatch { .. }
| WriteRecordBatch { .. }
| EncodeRecordBatch { .. }
| BufferedWriterClosed { .. } => StatusCode::Unexpected,
| BufferedWriterClosed { .. }
| OrcReader { .. } => StatusCode::Unexpected,
}
}
@@ -182,6 +189,7 @@ impl ErrorExt for Error {
fn location_opt(&self) -> Option<common_error::snafu::Location> {
use Error::*;
match self {
OrcReader { location, .. } => Some(*location),
BuildBackend { location, .. } => Some(*location),
ReadObject { location, .. } => Some(*location),
ListObjects { location, .. } => Some(*location),

View File

@@ -14,6 +14,7 @@
pub mod csv;
pub mod json;
pub mod orc;
pub mod parquet;
#[cfg(test)]
pub mod tests;
@@ -35,12 +36,12 @@ use datafusion::physical_plan::SendableRecordBatchStream;
use futures::StreamExt;
use object_store::ObjectStore;
use snafu::ResultExt;
use tokio_util::compat::FuturesAsyncWriteCompatExt;
use self::csv::CsvFormat;
use self::json::JsonFormat;
use self::orc::OrcFormat;
use self::parquet::ParquetFormat;
use crate::buffered_writer::{BufferedWriter, DfRecordBatchEncoder};
use crate::buffered_writer::{DfRecordBatchEncoder, LazyBufferedWriter};
use crate::compression::CompressionType;
use crate::error::{self, Result};
use crate::share_buffer::SharedBuffer;
@@ -57,6 +58,18 @@ pub enum Format {
Csv(CsvFormat),
Json(JsonFormat),
Parquet(ParquetFormat),
Orc(OrcFormat),
}
impl Format {
pub fn suffix(&self) -> &'static str {
match self {
Format::Csv(_) => ".csv",
Format::Json(_) => ".json",
Format::Parquet(_) => ".parquet",
&Format::Orc(_) => ".orc",
}
}
}
impl TryFrom<&HashMap<String, String>> for Format {
@@ -72,6 +85,7 @@ impl TryFrom<&HashMap<String, String>> for Format {
"CSV" => Ok(Self::Csv(CsvFormat::try_from(options)?)),
"JSON" => Ok(Self::Json(JsonFormat::try_from(options)?)),
"PARQUET" => Ok(Self::Parquet(ParquetFormat::default())),
"ORC" => Ok(Self::Orc(OrcFormat)),
_ => error::UnsupportedFormatSnafu { format: &format }.fail(),
}
}
@@ -181,15 +195,14 @@ pub async fn stream_to_file<T: DfRecordBatchEncoder, U: Fn(SharedBuffer) -> T>(
threshold: usize,
encoder_factory: U,
) -> Result<usize> {
let writer = store
.writer(path)
.await
.context(error::WriteObjectSnafu { path })?
.compat_write();
let buffer = SharedBuffer::with_capacity(threshold);
let encoder = encoder_factory(buffer.clone());
let mut writer = BufferedWriter::new(threshold, buffer, encoder, writer);
let mut writer = LazyBufferedWriter::new(threshold, buffer, encoder, path, |path| async {
store
.writer(&path)
.await
.context(error::WriteObjectSnafu { path })
});
let mut rows = 0;
@@ -201,8 +214,7 @@ pub async fn stream_to_file<T: DfRecordBatchEncoder, U: Fn(SharedBuffer) -> T>(
// Flushes all pending writes
writer.try_flush(true).await?;
writer.close().await?;
writer.close_inner_writer().await?;
Ok(rows)
}

View File

@@ -0,0 +1,102 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::pin::Pin;
use std::task::{Context, Poll};
use arrow_schema::{Schema, SchemaRef};
use async_trait::async_trait;
use datafusion::arrow::record_batch::RecordBatch as DfRecordBatch;
use datafusion::error::{DataFusionError, Result as DfResult};
use datafusion::physical_plan::RecordBatchStream;
use futures::Stream;
use object_store::ObjectStore;
use orc_rust::arrow_reader::{create_arrow_schema, Cursor};
use orc_rust::async_arrow_reader::ArrowStreamReader;
pub use orc_rust::error::Error as OrcError;
use orc_rust::reader::Reader;
use snafu::ResultExt;
use tokio::io::{AsyncRead, AsyncSeek};
use crate::error::{self, Result};
use crate::file_format::FileFormat;
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub struct OrcFormat;
pub async fn new_orc_cursor<R: AsyncRead + AsyncSeek + Unpin + Send + 'static>(
reader: R,
) -> Result<Cursor<R>> {
let reader = Reader::new_async(reader)
.await
.context(error::OrcReaderSnafu)?;
let cursor = Cursor::root(reader).context(error::OrcReaderSnafu)?;
Ok(cursor)
}
pub async fn new_orc_stream_reader<R: AsyncRead + AsyncSeek + Unpin + Send + 'static>(
reader: R,
) -> Result<ArrowStreamReader<R>> {
let cursor = new_orc_cursor(reader).await?;
Ok(ArrowStreamReader::new(cursor, None))
}
pub async fn infer_orc_schema<R: AsyncRead + AsyncSeek + Unpin + Send + 'static>(
reader: R,
) -> Result<Schema> {
let cursor = new_orc_cursor(reader).await?;
Ok(create_arrow_schema(&cursor))
}
pub struct OrcArrowStreamReaderAdapter<T: AsyncRead + AsyncSeek + Unpin + Send + 'static> {
stream: ArrowStreamReader<T>,
}
impl<T: AsyncRead + AsyncSeek + Unpin + Send + 'static> OrcArrowStreamReaderAdapter<T> {
pub fn new(stream: ArrowStreamReader<T>) -> Self {
Self { stream }
}
}
impl<T: AsyncRead + AsyncSeek + Unpin + Send + 'static> RecordBatchStream
for OrcArrowStreamReaderAdapter<T>
{
fn schema(&self) -> SchemaRef {
self.stream.schema()
}
}
impl<T: AsyncRead + AsyncSeek + Unpin + Send + 'static> Stream for OrcArrowStreamReaderAdapter<T> {
type Item = DfResult<DfRecordBatch>;
fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
let batch = futures::ready!(Pin::new(&mut self.stream).poll_next(cx))
.map(|r| r.map_err(|e| DataFusionError::External(Box::new(e))));
Poll::Ready(batch)
}
}
#[async_trait]
impl FileFormat for OrcFormat {
async fn infer_schema(&self, store: &ObjectStore, path: &str) -> Result<Schema> {
let reader = store
.reader(path)
.await
.context(error::ReadObjectSnafu { path })?;
let schema = infer_orc_schema(reader).await?;
Ok(schema)
}
}

View File

@@ -0,0 +1,11 @@
## Generate orc data
```bash
python3 -m venv venv
venv/bin/pip install -U pip
venv/bin/pip install -U pyorc
./venv/bin/python write.py
cargo test
```

Binary file not shown.

View File

@@ -0,0 +1,103 @@
# Copyright 2023 Greptime Team
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import datetime
import pyorc
data = {
"double_a": [1.0, 2.0, 3.0, 4.0, 5.0],
"a": [1.0, 2.0, None, 4.0, 5.0],
"b": [True, False, None, True, False],
"str_direct": ["a", "cccccc", None, "ddd", "ee"],
"d": ["a", "bb", None, "ccc", "ddd"],
"e": ["ddd", "cc", None, "bb", "a"],
"f": ["aaaaa", "bbbbb", None, "ccccc", "ddddd"],
"int_short_repeated": [5, 5, None, 5, 5],
"int_neg_short_repeated": [-5, -5, None, -5, -5],
"int_delta": [1, 2, None, 4, 5],
"int_neg_delta": [5, 4, None, 2, 1],
"int_direct": [1, 6, None, 3, 2],
"int_neg_direct": [-1, -6, None, -3, -2],
"bigint_direct": [1, 6, None, 3, 2],
"bigint_neg_direct": [-1, -6, None, -3, -2],
"bigint_other": [5, -5, 1, 5, 5],
"utf8_increase": ["a", "bb", "ccc", "dddd", "eeeee"],
"utf8_decrease": ["eeeee", "dddd", "ccc", "bb", "a"],
"timestamp_simple": [datetime.datetime(2023, 4, 1, 20, 15, 30, 2000), datetime.datetime.fromtimestamp(int('1629617204525777000')/1000000000), datetime.datetime(2023, 1, 1), datetime.datetime(2023, 2, 1), datetime.datetime(2023, 3, 1)],
"date_simple": [datetime.date(2023, 4, 1), datetime.date(2023, 3, 1), datetime.date(2023, 1, 1), datetime.date(2023, 2, 1), datetime.date(2023, 3, 1)]
}
def infer_schema(data):
schema = "struct<"
for key, value in data.items():
dt = type(value[0])
if dt == float:
dt = "float"
elif dt == int:
dt = "int"
elif dt == bool:
dt = "boolean"
elif dt == str:
dt = "string"
elif key.startswith("timestamp"):
dt = "timestamp"
elif key.startswith("date"):
dt = "date"
else:
print(key,value,dt)
raise NotImplementedError
if key.startswith("double"):
dt = "double"
if key.startswith("bigint"):
dt = "bigint"
schema += key + ":" + dt + ","
schema = schema[:-1] + ">"
return schema
def _write(
schema: str,
data,
file_name: str,
compression=pyorc.CompressionKind.NONE,
dict_key_size_threshold=0.0,
):
output = open(file_name, "wb")
writer = pyorc.Writer(
output,
schema,
dict_key_size_threshold=dict_key_size_threshold,
# use a small number to ensure that compression crosses value boundaries
compression_block_size=32,
compression=compression,
)
num_rows = len(list(data.values())[0])
for x in range(num_rows):
row = tuple(values[x] for values in data.values())
writer.write(row)
writer.close()
with open(file_name, "rb") as f:
reader = pyorc.Reader(f)
list(reader)
_write(
infer_schema(data),
data,
"test.orc",
)

View File

@@ -16,13 +16,16 @@ use std::fmt;
use std::str::FromStr;
use std::sync::Arc;
use common_query::error::{self, Result, UnsupportedInputDataTypeSnafu};
use common_query::error::{InvalidFuncArgsSnafu, Result, UnsupportedInputDataTypeSnafu};
use common_query::prelude::{Signature, Volatility};
use common_time::timestamp::TimeUnit;
use common_time::Timestamp;
use datatypes::prelude::ConcreteDataType;
use datatypes::types::StringType;
use datatypes::vectors::{Int64Vector, StringVector, Vector, VectorRef};
use datatypes::types::TimestampType;
use datatypes::vectors::{
Int64Vector, StringVector, TimestampMicrosecondVector, TimestampMillisecondVector,
TimestampNanosecondVector, TimestampSecondVector, Vector, VectorRef,
};
use snafu::ensure;
use crate::scalars::function::{Function, FunctionContext};
@@ -42,18 +45,33 @@ fn convert_to_seconds(arg: &str) -> Option<i64> {
}
}
fn process_vector(vector: &dyn Vector) -> Vec<Option<i64>> {
(0..vector.len())
.map(|i| paste::expr!((vector.get(i)).as_timestamp().map(|ts| ts.value())))
.collect::<Vec<Option<i64>>>()
}
impl Function for ToUnixtimeFunction {
fn name(&self) -> &str {
NAME
}
fn return_type(&self, _input_types: &[ConcreteDataType]) -> Result<ConcreteDataType> {
Ok(ConcreteDataType::timestamp_second_datatype())
Ok(ConcreteDataType::int64_datatype())
}
fn signature(&self) -> Signature {
Signature::exact(
vec![ConcreteDataType::String(StringType)],
Signature::uniform(
1,
vec![
ConcreteDataType::string_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int64_datatype(),
ConcreteDataType::timestamp_second_datatype(),
ConcreteDataType::timestamp_millisecond_datatype(),
ConcreteDataType::timestamp_microsecond_datatype(),
ConcreteDataType::timestamp_nanosecond_datatype(),
],
Volatility::Immutable,
)
}
@@ -61,7 +79,7 @@ impl Function for ToUnixtimeFunction {
fn eval(&self, _func_ctx: FunctionContext, columns: &[VectorRef]) -> Result<VectorRef> {
ensure!(
columns.len() == 1,
error::InvalidFuncArgsSnafu {
InvalidFuncArgsSnafu {
err_msg: format!(
"The length of the args is not correct, expect exactly one, have: {}",
columns.len()
@@ -79,6 +97,42 @@ impl Function for ToUnixtimeFunction {
.collect::<Vec<_>>(),
)))
}
ConcreteDataType::Int64(_) | ConcreteDataType::Int32(_) => {
let array = columns[0].to_arrow_array();
Ok(Arc::new(Int64Vector::try_from_arrow_array(&array).unwrap()))
}
ConcreteDataType::Timestamp(ts) => {
let array = columns[0].to_arrow_array();
let value = match ts {
TimestampType::Second(_) => {
let vector = paste::expr!(TimestampSecondVector::try_from_arrow_array(
array
)
.unwrap());
process_vector(&vector)
}
TimestampType::Millisecond(_) => {
let vector = paste::expr!(
TimestampMillisecondVector::try_from_arrow_array(array).unwrap()
);
process_vector(&vector)
}
TimestampType::Microsecond(_) => {
let vector = paste::expr!(
TimestampMicrosecondVector::try_from_arrow_array(array).unwrap()
);
process_vector(&vector)
}
TimestampType::Nanosecond(_) => {
let vector = paste::expr!(TimestampNanosecondVector::try_from_arrow_array(
array
)
.unwrap());
process_vector(&vector)
}
};
Ok(Arc::new(Int64Vector::from(value)))
}
_ => UnsupportedInputDataTypeSnafu {
function: NAME,
datatypes: columns.iter().map(|c| c.data_type()).collect::<Vec<_>>(),
@@ -97,28 +151,37 @@ impl fmt::Display for ToUnixtimeFunction {
#[cfg(test)]
mod tests {
use common_query::prelude::TypeSignature;
use datatypes::prelude::ConcreteDataType;
use datatypes::types::StringType;
use datatypes::prelude::{ConcreteDataType, ScalarVectorBuilder};
use datatypes::scalars::ScalarVector;
use datatypes::timestamp::TimestampSecond;
use datatypes::value::Value;
use datatypes::vectors::StringVector;
use datatypes::vectors::{StringVector, TimestampSecondVector};
use super::{ToUnixtimeFunction, *};
use crate::scalars::Function;
#[test]
fn test_to_unixtime() {
fn test_string_to_unixtime() {
let f = ToUnixtimeFunction::default();
assert_eq!("to_unixtime", f.name());
assert_eq!(
ConcreteDataType::timestamp_second_datatype(),
ConcreteDataType::int64_datatype(),
f.return_type(&[]).unwrap()
);
assert!(matches!(f.signature(),
Signature {
type_signature: TypeSignature::Exact(valid_types),
volatility: Volatility::Immutable
} if valid_types == vec![ConcreteDataType::String(StringType)]
Signature {
type_signature: TypeSignature::Uniform(1, valid_types),
volatility: Volatility::Immutable
} if valid_types == vec![
ConcreteDataType::string_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int64_datatype(),
ConcreteDataType::timestamp_second_datatype(),
ConcreteDataType::timestamp_millisecond_datatype(),
ConcreteDataType::timestamp_microsecond_datatype(),
ConcreteDataType::timestamp_nanosecond_datatype(),
]
));
let times = vec![
@@ -145,4 +208,106 @@ mod tests {
}
}
}
#[test]
fn test_int_to_unixtime() {
let f = ToUnixtimeFunction::default();
assert_eq!("to_unixtime", f.name());
assert_eq!(
ConcreteDataType::int64_datatype(),
f.return_type(&[]).unwrap()
);
assert!(matches!(f.signature(),
Signature {
type_signature: TypeSignature::Uniform(1, valid_types),
volatility: Volatility::Immutable
} if valid_types == vec![
ConcreteDataType::string_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int64_datatype(),
ConcreteDataType::timestamp_second_datatype(),
ConcreteDataType::timestamp_millisecond_datatype(),
ConcreteDataType::timestamp_microsecond_datatype(),
ConcreteDataType::timestamp_nanosecond_datatype(),
]
));
let times = vec![Some(3_i64), None, Some(5_i64), None];
let results = vec![Some(3), None, Some(5), None];
let args: Vec<VectorRef> = vec![Arc::new(Int64Vector::from(times.clone()))];
let vector = f.eval(FunctionContext::default(), &args).unwrap();
assert_eq!(4, vector.len());
for (i, _t) in times.iter().enumerate() {
let v = vector.get(i);
if i == 1 || i == 3 {
assert_eq!(Value::Null, v);
continue;
}
match v {
Value::Int64(ts) => {
assert_eq!(ts, (*results.get(i).unwrap()).unwrap());
}
_ => unreachable!(),
}
}
}
#[test]
fn test_timestamp_to_unixtime() {
let f = ToUnixtimeFunction::default();
assert_eq!("to_unixtime", f.name());
assert_eq!(
ConcreteDataType::int64_datatype(),
f.return_type(&[]).unwrap()
);
assert!(matches!(f.signature(),
Signature {
type_signature: TypeSignature::Uniform(1, valid_types),
volatility: Volatility::Immutable
} if valid_types == vec![
ConcreteDataType::string_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int64_datatype(),
ConcreteDataType::timestamp_second_datatype(),
ConcreteDataType::timestamp_millisecond_datatype(),
ConcreteDataType::timestamp_microsecond_datatype(),
ConcreteDataType::timestamp_nanosecond_datatype(),
]
));
let times: Vec<Option<TimestampSecond>> = vec![
Some(TimestampSecond::new(123)),
None,
Some(TimestampSecond::new(42)),
None,
];
let results = vec![Some(123), None, Some(42), None];
let ts_vector: TimestampSecondVector = build_vector_from_slice(&times);
let args: Vec<VectorRef> = vec![Arc::new(ts_vector)];
let vector = f.eval(FunctionContext::default(), &args).unwrap();
assert_eq!(4, vector.len());
for (i, _t) in times.iter().enumerate() {
let v = vector.get(i);
if i == 1 || i == 3 {
assert_eq!(Value::Null, v);
continue;
}
match v {
Value::Int64(ts) => {
assert_eq!(ts, (*results.get(i).unwrap()).unwrap());
}
_ => unreachable!(),
}
}
}
fn build_vector_from_slice<T: ScalarVector>(items: &[Option<T::RefItem<'_>>]) -> T {
let mut builder = T::Builder::with_capacity(items.len());
for item in items {
builder.push(*item);
}
builder.finish()
}
}

View File

@@ -34,7 +34,7 @@ const LOCATION_TYPE_FIRST: i32 = LocationType::First as i32;
const LOCATION_TYPE_AFTER: i32 = LocationType::After as i32;
/// Convert an [`AlterExpr`] to an [`AlterTableRequest`]
pub fn alter_expr_to_request(expr: AlterExpr) -> Result<AlterTableRequest> {
pub fn alter_expr_to_request(table_id: TableId, expr: AlterExpr) -> Result<AlterTableRequest> {
let catalog_name = expr.catalog_name;
let schema_name = expr.schema_name;
let kind = expr.kind.context(MissingFieldSnafu { field: "kind" })?;
@@ -69,6 +69,7 @@ pub fn alter_expr_to_request(expr: AlterExpr) -> Result<AlterTableRequest> {
catalog_name,
schema_name,
table_name: expr.table_name,
table_id,
alter_kind,
};
Ok(request)
@@ -82,6 +83,7 @@ pub fn alter_expr_to_request(expr: AlterExpr) -> Result<AlterTableRequest> {
catalog_name,
schema_name,
table_name: expr.table_name,
table_id,
alter_kind,
};
Ok(request)
@@ -92,6 +94,7 @@ pub fn alter_expr_to_request(expr: AlterExpr) -> Result<AlterTableRequest> {
catalog_name,
schema_name,
table_name: expr.table_name,
table_id,
alter_kind,
};
Ok(request)
@@ -239,7 +242,7 @@ mod tests {
})),
};
let alter_request = alter_expr_to_request(expr).unwrap();
let alter_request = alter_expr_to_request(1, expr).unwrap();
assert_eq!(alter_request.catalog_name, "");
assert_eq!(alter_request.schema_name, "");
assert_eq!("monitor".to_string(), alter_request.table_name);
@@ -296,7 +299,7 @@ mod tests {
})),
};
let alter_request = alter_expr_to_request(expr).unwrap();
let alter_request = alter_expr_to_request(1, expr).unwrap();
assert_eq!(alter_request.catalog_name, "");
assert_eq!(alter_request.schema_name, "");
assert_eq!("monitor".to_string(), alter_request.table_name);
@@ -344,7 +347,7 @@ mod tests {
})),
};
let alter_request = alter_expr_to_request(expr).unwrap();
let alter_request = alter_expr_to_request(1, expr).unwrap();
assert_eq!(alter_request.catalog_name, "test_catalog");
assert_eq!(alter_request.schema_name, "test_schema");
assert_eq!("monitor".to_string(), alter_request.table_name);

View File

@@ -6,6 +6,7 @@ license.workspace = true
[dependencies]
api = { path = "../../api" }
async-trait.workspace = true
common-catalog = { path = "../catalog" }
common-error = { path = "../error" }
common-runtime = { path = "../runtime" }

View File

@@ -52,6 +52,9 @@ pub enum Error {
err_msg: String,
location: Location,
},
#[snafu(display("Invalid protobuf message, err: {}", err_msg))]
InvalidProtoMsg { err_msg: String, location: Location },
}
pub type Result<T> = std::result::Result<T, Error>;
@@ -61,7 +64,10 @@ impl ErrorExt for Error {
use Error::*;
match self {
IllegalServerState { .. } => StatusCode::Internal,
SerdeJson { .. } | RouteInfoCorrupted { .. } => StatusCode::Unexpected,
SerdeJson { .. } | RouteInfoCorrupted { .. } | InvalidProtoMsg { .. } => {
StatusCode::Unexpected
}
SendMessage { .. } => StatusCode::Internal,

View File

@@ -15,6 +15,7 @@
use std::sync::Arc;
use api::v1::meta::HeartbeatResponse;
use async_trait::async_trait;
use common_telemetry::error;
use crate::error::Result;
@@ -57,14 +58,16 @@ impl HeartbeatResponseHandlerContext {
/// [`HeartbeatResponseHandler::is_acceptable`] returns true if handler can handle incoming [`HeartbeatResponseHandlerContext`].
///
/// [`HeartbeatResponseHandler::handle`] handles all or part of incoming [`HeartbeatResponseHandlerContext`].
#[async_trait]
pub trait HeartbeatResponseHandler: Send + Sync {
fn is_acceptable(&self, ctx: &HeartbeatResponseHandlerContext) -> bool;
fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> Result<HandleControl>;
async fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> Result<HandleControl>;
}
#[async_trait]
pub trait HeartbeatResponseHandlerExecutor: Send + Sync {
fn handle(&self, ctx: HeartbeatResponseHandlerContext) -> Result<()>;
async fn handle(&self, ctx: HeartbeatResponseHandlerContext) -> Result<()>;
}
pub struct HandlerGroupExecutor {
@@ -77,14 +80,15 @@ impl HandlerGroupExecutor {
}
}
#[async_trait]
impl HeartbeatResponseHandlerExecutor for HandlerGroupExecutor {
fn handle(&self, mut ctx: HeartbeatResponseHandlerContext) -> Result<()> {
async fn handle(&self, mut ctx: HeartbeatResponseHandlerContext) -> Result<()> {
for handler in &self.handlers {
if !handler.is_acceptable(&ctx) {
continue;
}
match handler.handle(&mut ctx) {
match handler.handle(&mut ctx).await {
Ok(HandleControl::Done) => break,
Ok(HandleControl::Continue) => {}
Err(e) => {

View File

@@ -12,6 +12,8 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use async_trait::async_trait;
use crate::error::Result;
use crate::heartbeat::handler::{
HandleControl, HeartbeatResponseHandler, HeartbeatResponseHandlerContext,
@@ -21,12 +23,13 @@ use crate::heartbeat::utils::mailbox_message_to_incoming_message;
#[derive(Default)]
pub struct ParseMailboxMessageHandler;
#[async_trait]
impl HeartbeatResponseHandler for ParseMailboxMessageHandler {
fn is_acceptable(&self, _ctx: &HeartbeatResponseHandlerContext) -> bool {
true
}
fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> Result<HandleControl> {
async fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> Result<HandleControl> {
if let Some(message) = &ctx.response.mailbox_message {
if message.payload.is_some() {
// mailbox_message_to_incoming_message will raise an error if payload is none

View File

@@ -0,0 +1,71 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::fmt::{Display, Formatter};
use api::v1::meta::{TableIdent as RawTableIdent, TableName};
use serde::{Deserialize, Serialize};
use snafu::OptionExt;
use crate::error::{Error, InvalidProtoMsgSnafu};
#[derive(Eq, Hash, PartialEq, Clone, Debug, Serialize, Deserialize)]
pub struct TableIdent {
pub catalog: String,
pub schema: String,
pub table: String,
pub table_id: u32,
pub engine: String,
}
impl Display for TableIdent {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
write!(
f,
"Table(id={}, name='{}.{}.{}', engine='{}')",
self.table_id, self.catalog, self.schema, self.table, self.engine,
)
}
}
impl TryFrom<RawTableIdent> for TableIdent {
type Error = Error;
fn try_from(value: RawTableIdent) -> Result<Self, Self::Error> {
let table_name = value.table_name.context(InvalidProtoMsgSnafu {
err_msg: "'table_name' is missing in TableIdent",
})?;
Ok(Self {
catalog: table_name.catalog_name,
schema: table_name.schema_name,
table: table_name.table_name,
table_id: value.table_id,
engine: value.engine,
})
}
}
impl From<TableIdent> for RawTableIdent {
fn from(table_ident: TableIdent) -> Self {
Self {
table_id: table_ident.table_id,
engine: table_ident.engine,
table_name: Some(TableName {
catalog_name: table_ident.catalog,
schema_name: table_ident.schema,
table_name: table_ident.table,
}),
}
}
}

View File

@@ -16,6 +16,7 @@ use std::fmt::{Display, Formatter};
use serde::{Deserialize, Serialize};
use crate::ident::TableIdent;
use crate::{ClusterId, DatanodeId};
#[derive(Eq, Hash, PartialEq, Clone, Debug, Serialize, Deserialize)]
@@ -49,25 +50,6 @@ impl From<RegionIdent> for TableIdent {
}
}
#[derive(Eq, Hash, PartialEq, Clone, Debug, Serialize, Deserialize)]
pub struct TableIdent {
pub catalog: String,
pub schema: String,
pub table: String,
pub table_id: u32,
pub engine: String,
}
impl Display for TableIdent {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
write!(
f,
"TableIdent(table_id='{}', table_name='{}.{}.{}', table_engine='{}')",
self.table_id, self.catalog, self.schema, self.table, self.engine,
)
}
}
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
pub struct SimpleReply {
pub result: bool,

View File

@@ -14,6 +14,7 @@
pub mod error;
pub mod heartbeat;
pub mod ident;
pub mod instruction;
pub mod key;
pub mod peer;

View File

@@ -20,6 +20,7 @@ mod udf;
use std::sync::Arc;
use datatypes::prelude::ConcreteDataType;
pub use expr::build_filter_from_timestamp;
pub use self::accumulator::{Accumulator, AggregateFunctionCreator, AggregateFunctionCreatorRef};
pub use self::expr::{DfExpr, Expr};
@@ -28,7 +29,6 @@ pub use self::udf::ScalarUdf;
use crate::function::{ReturnTypeFunction, ScalarFunctionImplementation};
use crate::logical_plan::accumulator::*;
use crate::signature::{Signature, Volatility};
/// Creates a new UDF with a specific signature and specific return type.
/// This is a helper function to create a new UDF.
/// The function `create_udf` returns a subset of all possible `ScalarFunction`:

View File

@@ -12,7 +12,12 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use common_time::range::TimestampRange;
use common_time::timestamp::TimeUnit;
use common_time::Timestamp;
use datafusion_common::{Column, ScalarValue};
pub use datafusion_expr::expr::Expr as DfExpr;
use datafusion_expr::{and, binary_expr, Operator};
/// Central struct of query API.
/// Represent logical expressions such as `A + 1`, or `CAST(c1 AS int)`.
@@ -33,6 +38,54 @@ impl From<DfExpr> for Expr {
}
}
/// Builds an `Expr` that filters timestamp column from given timestamp range.
/// Returns [None] if time range is [None] or full time range.
pub fn build_filter_from_timestamp(
ts_col_name: &str,
time_range: Option<&TimestampRange>,
) -> Option<Expr> {
let Some(time_range) = time_range else { return None; };
let ts_col_expr = DfExpr::Column(Column {
relation: None,
name: ts_col_name.to_string(),
});
let df_expr = match (time_range.start(), time_range.end()) {
(None, None) => None,
(Some(start), None) => Some(binary_expr(
ts_col_expr,
Operator::GtEq,
timestamp_to_literal(start),
)),
(None, Some(end)) => Some(binary_expr(
ts_col_expr,
Operator::Lt,
timestamp_to_literal(end),
)),
(Some(start), Some(end)) => Some(and(
binary_expr(
ts_col_expr.clone(),
Operator::GtEq,
timestamp_to_literal(start),
),
binary_expr(ts_col_expr, Operator::Lt, timestamp_to_literal(end)),
)),
};
df_expr.map(Expr::from)
}
/// Converts a [Timestamp] to datafusion literal value.
fn timestamp_to_literal(timestamp: &Timestamp) -> DfExpr {
let scalar_value = match timestamp.unit() {
TimeUnit::Second => ScalarValue::TimestampSecond(Some(timestamp.value()), None),
TimeUnit::Millisecond => ScalarValue::TimestampMillisecond(Some(timestamp.value()), None),
TimeUnit::Microsecond => ScalarValue::TimestampMicrosecond(Some(timestamp.value()), None),
TimeUnit::Nanosecond => ScalarValue::TimestampNanosecond(Some(timestamp.value()), None),
};
DfExpr::Literal(scalar_value)
}
#[cfg(test)]
mod tests {
use super::*;

View File

@@ -22,6 +22,7 @@ use datafusion::arrow::datatypes::SchemaRef as DfSchemaRef;
use datafusion::error::Result as DfResult;
pub use datafusion::execution::context::{SessionContext, TaskContext};
use datafusion::physical_plan::expressions::PhysicalSortExpr;
use datafusion::physical_plan::metrics::{BaselineMetrics, ExecutionPlanMetricsSet, MetricsSet};
pub use datafusion::physical_plan::Partitioning;
use datafusion::physical_plan::Statistics;
use datatypes::schema::SchemaRef;
@@ -69,6 +70,10 @@ pub trait PhysicalPlan: Debug + Send + Sync {
partition: usize,
context: Arc<TaskContext>,
) -> Result<SendableRecordBatchStream>;
fn metrics(&self) -> Option<MetricsSet> {
None
}
}
/// Adapt DataFusion's [`ExecutionPlan`](DfPhysicalPlan) to GreptimeDB's [`PhysicalPlan`].
@@ -76,11 +81,16 @@ pub trait PhysicalPlan: Debug + Send + Sync {
pub struct PhysicalPlanAdapter {
schema: SchemaRef,
df_plan: Arc<dyn DfPhysicalPlan>,
metric: ExecutionPlanMetricsSet,
}
impl PhysicalPlanAdapter {
pub fn new(schema: SchemaRef, df_plan: Arc<dyn DfPhysicalPlan>) -> Self {
Self { schema, df_plan }
Self {
schema,
df_plan,
metric: ExecutionPlanMetricsSet::new(),
}
}
pub fn df_plan(&self) -> Arc<dyn DfPhysicalPlan> {
@@ -127,15 +137,21 @@ impl PhysicalPlan for PhysicalPlanAdapter {
partition: usize,
context: Arc<TaskContext>,
) -> Result<SendableRecordBatchStream> {
let baseline_metric = BaselineMetrics::new(&self.metric, partition);
let df_plan = self.df_plan.clone();
let stream = df_plan
.execute(partition, context)
.context(error::GeneralDataFusionSnafu)?;
let adapter = RecordBatchStreamAdapter::try_new(stream)
let adapter = RecordBatchStreamAdapter::try_new_with_metrics(stream, baseline_metric)
.context(error::ConvertDfRecordBatchStreamSnafu)?;
Ok(Box::pin(adapter))
}
fn metrics(&self) -> Option<MetricsSet> {
Some(self.metric.clone_inner())
}
}
#[derive(Debug)]
@@ -196,6 +212,10 @@ impl DfPhysicalPlan for DfPhysicalPlanAdapter {
fn statistics(&self) -> Statistics {
Statistics::default()
}
fn metrics(&self) -> Option<MetricsSet> {
self.0.metrics()
}
}
#[cfg(test)]

View File

@@ -20,6 +20,7 @@ use std::task::{Context, Poll};
use datafusion::arrow::datatypes::SchemaRef as DfSchemaRef;
use datafusion::error::Result as DfResult;
use datafusion::parquet::arrow::async_reader::{AsyncFileReader, ParquetRecordBatchStream};
use datafusion::physical_plan::metrics::BaselineMetrics;
use datafusion::physical_plan::RecordBatchStream as DfRecordBatchStream;
use datafusion_common::DataFusionError;
use datatypes::schema::{Schema, SchemaRef};
@@ -115,13 +116,31 @@ impl Stream for DfRecordBatchStreamAdapter {
pub struct RecordBatchStreamAdapter {
schema: SchemaRef,
stream: DfSendableRecordBatchStream,
metrics: Option<BaselineMetrics>,
}
impl RecordBatchStreamAdapter {
pub fn try_new(stream: DfSendableRecordBatchStream) -> Result<Self> {
let schema =
Arc::new(Schema::try_from(stream.schema()).context(error::SchemaConversionSnafu)?);
Ok(Self { schema, stream })
Ok(Self {
schema,
stream,
metrics: None,
})
}
pub fn try_new_with_metrics(
stream: DfSendableRecordBatchStream,
metrics: BaselineMetrics,
) -> Result<Self> {
let schema =
Arc::new(Schema::try_from(stream.schema()).context(error::SchemaConversionSnafu)?);
Ok(Self {
schema,
stream,
metrics: Some(metrics),
})
}
}
@@ -135,6 +154,12 @@ impl Stream for RecordBatchStreamAdapter {
type Item = Result<RecordBatch>;
fn poll_next(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
let timer = self
.metrics
.as_ref()
.map(|m| m.elapsed_compute().clone())
.unwrap_or_default();
let _guard = timer.timer();
match Pin::new(&mut self.stream).poll_next(cx) {
Poll::Pending => Poll::Pending,
Poll::Ready(Some(df_record_batch)) => {

View File

@@ -52,6 +52,12 @@ impl From<i32> for Date {
}
}
impl From<NaiveDate> for Date {
fn from(date: NaiveDate) -> Self {
Self(date.num_days_from_ce() - UNIX_EPOCH_FROM_CE)
}
}
impl Display for Date {
/// [Date] is formatted according to ISO-8601 standard.
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {

View File

@@ -14,6 +14,8 @@
use std::fmt::{Debug, Display, Formatter};
use serde::{Deserialize, Serialize};
use crate::timestamp::TimeUnit;
use crate::timestamp_millis::TimestampMillis;
use crate::Timestamp;
@@ -23,7 +25,7 @@ use crate::Timestamp;
/// The range contains values that `value >= start` and `val < end`.
///
/// The range is empty iff `start == end == "the default value of T"`
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Serialize, Deserialize)]
pub struct GenericRange<T> {
start: Option<T>,
end: Option<T>,
@@ -522,4 +524,25 @@ mod tests {
);
assert!(range.is_empty());
}
#[test]
fn test_serialize_timestamp_range() {
macro_rules! test_serde_for_unit {
($($unit: expr),*) => {
$(
let original_range = TimestampRange::with_unit(0, 10, $unit).unwrap();
let string = serde_json::to_string(&original_range).unwrap();
let deserialized: TimestampRange = serde_json::from_str(&string).unwrap();
assert_eq!(original_range, deserialized);
)*
};
}
test_serde_for_unit!(
TimeUnit::Second,
TimeUnit::Millisecond,
TimeUnit::Microsecond,
TimeUnit::Nanosecond
);
}
}

View File

@@ -192,8 +192,8 @@ impl Default for WalConfig {
fn default() -> Self {
Self {
dir: None,
file_size: ReadableSize::gb(1), // log file size 1G
purge_threshold: ReadableSize::gb(50), // purge threshold 50G
file_size: ReadableSize::mb(256), // log file size 256MB
purge_threshold: ReadableSize::gb(4), // purge threshold 4GB
purge_interval: Duration::from_secs(600),
read_batch_size: 128,
sync_write: false,

View File

@@ -17,24 +17,27 @@ use std::sync::Arc;
use std::time::Duration;
use api::v1::meta::{HeartbeatRequest, NodeStat, Peer};
use catalog::remote::region_alive_keeper::RegionAliveKeepers;
use catalog::{datanode_stat, CatalogManagerRef};
use common_meta::heartbeat::handler::{
HeartbeatResponseHandlerContext, HeartbeatResponseHandlerExecutorRef,
};
use common_meta::heartbeat::mailbox::{HeartbeatMailbox, MailboxRef};
use common_meta::heartbeat::utils::outgoing_message_to_mailbox_message;
use common_telemetry::{error, info, trace, warn};
use common_telemetry::{debug, error, info, trace, warn};
use meta_client::client::{HeartbeatSender, MetaClient};
use snafu::ResultExt;
use tokio::sync::mpsc;
use tokio::time::Instant;
use crate::datanode::DatanodeOptions;
use crate::error::{self, MetaClientInitSnafu, Result};
pub(crate) mod handler;
pub struct HeartbeatTask {
node_id: u64,
node_epoch: u64,
server_addr: String,
server_hostname: Option<String>,
running: Arc<AtomicBool>,
@@ -42,6 +45,7 @@ pub struct HeartbeatTask {
catalog_manager: CatalogManagerRef,
interval: u64,
resp_handler_executor: HeartbeatResponseHandlerExecutorRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
}
impl Drop for HeartbeatTask {
@@ -54,21 +58,25 @@ impl HeartbeatTask {
/// Create a new heartbeat task instance.
pub fn new(
node_id: u64,
server_addr: String,
server_hostname: Option<String>,
opts: &DatanodeOptions,
meta_client: Arc<MetaClient>,
catalog_manager: CatalogManagerRef,
resp_handler_executor: HeartbeatResponseHandlerExecutorRef,
heartbeat_interval_millis: u64,
region_alive_keepers: Arc<RegionAliveKeepers>,
) -> Self {
Self {
node_id,
server_addr,
server_hostname,
// We use datanode's start time millis as the node's epoch.
node_epoch: common_time::util::current_time_millis() as u64,
server_addr: opts.rpc_addr.clone(),
server_hostname: opts.rpc_hostname.clone(),
running: Arc::new(AtomicBool::new(false)),
meta_client,
catalog_manager,
interval: 5_000, // default interval is set to 5 secs
interval: heartbeat_interval_millis,
resp_handler_executor,
region_alive_keepers,
}
}
@@ -94,7 +102,7 @@ impl HeartbeatTask {
}
let ctx = HeartbeatResponseHandlerContext::new(mailbox.clone(), res);
if let Err(e) = Self::handle_response(ctx, handler_executor.clone()) {
if let Err(e) = Self::handle_response(ctx, handler_executor.clone()).await {
error!(e; "Error while handling heartbeat response");
}
if !running.load(Ordering::Acquire) {
@@ -106,13 +114,14 @@ impl HeartbeatTask {
Ok(tx)
}
fn handle_response(
async fn handle_response(
ctx: HeartbeatResponseHandlerContext,
handler_executor: HeartbeatResponseHandlerExecutorRef,
) -> Result<()> {
trace!("heartbeat response: {:?}", ctx.response);
handler_executor
.handle(ctx)
.await
.context(error::HandleHeartbeatResponseSnafu)
}
@@ -128,9 +137,12 @@ impl HeartbeatTask {
}
let interval = self.interval;
let node_id = self.node_id;
let node_epoch = self.node_epoch;
let addr = resolve_addr(&self.server_addr, &self.server_hostname);
info!("Starting heartbeat to Metasrv with interval {interval}. My node id is {node_id}, address is {addr}.");
self.region_alive_keepers.start().await;
let meta_client = self.meta_client.clone();
let catalog_manager_clone = self.catalog_manager.clone();
@@ -147,6 +159,7 @@ impl HeartbeatTask {
)
.await?;
let epoch = self.region_alive_keepers.epoch();
common_runtime::spawn_bg(async move {
let sleep = tokio::time::sleep(Duration::from_millis(0));
tokio::pin!(sleep);
@@ -192,6 +205,8 @@ impl HeartbeatTask {
..Default::default()
}),
region_stats,
duration_since_epoch: (Instant::now() - epoch).as_millis() as u64,
node_epoch,
..Default::default()
};
sleep.as_mut().reset(Instant::now() + Duration::from_millis(interval));
@@ -199,6 +214,7 @@ impl HeartbeatTask {
}
};
if let Some(req) = req {
debug!("Sending heartbeat request: {:?}", req);
if let Err(e) = tx.send(req).await {
error!("Failed to send heartbeat to metasrv, error: {:?}", e);
match Self::create_streams(

View File

@@ -14,15 +14,16 @@
use std::sync::Arc;
use async_trait::async_trait;
use catalog::remote::region_alive_keeper::RegionAliveKeepers;
use catalog::{CatalogManagerRef, DeregisterTableRequest};
use common_catalog::format_full_table_name;
use common_meta::error::Result as MetaResult;
use common_meta::heartbeat::handler::{
HandleControl, HeartbeatResponseHandler, HeartbeatResponseHandlerContext,
};
use common_meta::instruction::{
Instruction, InstructionReply, RegionIdent, SimpleReply, TableIdent,
};
use common_meta::instruction::{Instruction, InstructionReply, SimpleReply};
use common_meta::RegionIdent;
use common_telemetry::{error, info, warn};
use snafu::ResultExt;
use store_api::storage::RegionNumber;
@@ -36,8 +37,10 @@ use crate::error::{self, Result};
pub struct CloseRegionHandler {
catalog_manager: CatalogManagerRef,
table_engine_manager: TableEngineManagerRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
}
#[async_trait]
impl HeartbeatResponseHandler for CloseRegionHandler {
fn is_acceptable(&self, ctx: &HeartbeatResponseHandlerContext) -> bool {
matches!(
@@ -46,35 +49,15 @@ impl HeartbeatResponseHandler for CloseRegionHandler {
)
}
fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> MetaResult<HandleControl> {
async fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> MetaResult<HandleControl> {
let Some((meta, Instruction::CloseRegion(region_ident))) = ctx.incoming_message.take() else {
unreachable!("CloseRegionHandler: should be guarded by 'is_acceptable'");
};
let mailbox = ctx.mailbox.clone();
let self_ref = Arc::new(self.clone());
let RegionIdent {
table_ident:
TableIdent {
engine,
catalog,
schema,
table,
..
},
region_number,
..
} = region_ident;
common_runtime::spawn_bg(async move {
let result = self_ref
.close_region_inner(
engine,
&TableReference::full(&catalog, &schema, &table),
vec![region_number],
)
.await;
let result = self_ref.close_region_inner(region_ident).await;
if let Err(e) = mailbox
.send((meta, CloseRegionHandler::map_result(result)))
@@ -92,10 +75,12 @@ impl CloseRegionHandler {
pub fn new(
catalog_manager: CatalogManagerRef,
table_engine_manager: TableEngineManagerRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
) -> Self {
Self {
catalog_manager,
table_engine_manager,
region_alive_keepers,
}
}
@@ -151,20 +136,21 @@ impl CloseRegionHandler {
Ok(true)
}
async fn close_region_inner(
&self,
engine: String,
table_ref: &TableReference<'_>,
region_numbers: Vec<RegionNumber>,
) -> Result<bool> {
let engine =
self.table_engine_manager
.engine(&engine)
.context(error::TableEngineNotFoundSnafu {
engine_name: &engine,
})?;
async fn close_region_inner(&self, region_ident: RegionIdent) -> Result<bool> {
let table_ident = &region_ident.table_ident;
let engine_name = &table_ident.engine;
let engine = self
.table_engine_manager
.engine(engine_name)
.context(error::TableEngineNotFoundSnafu { engine_name })?;
let ctx = EngineContext::default();
let table_ref = &TableReference::full(
&table_ident.catalog,
&table_ident.schema,
&table_ident.table,
);
let region_numbers = vec![region_ident.region_number];
if self
.regions_closed(
table_ref.catalog,
@@ -178,7 +164,7 @@ impl CloseRegionHandler {
}
if engine
.get_table(&ctx, table_ref)
.get_table(&ctx, region_ident.table_ident.table_id)
.with_context(|_| error::GetTableSnafu {
table_name: table_ref.to_string(),
})?
@@ -192,6 +178,7 @@ impl CloseRegionHandler {
schema_name: table_ref.schema.to_string(),
table_name: table_ref.table.to_string(),
region_numbers: region_numbers.clone(),
table_id: region_ident.table_ident.table_id,
flush: true,
},
)
@@ -202,7 +189,15 @@ impl CloseRegionHandler {
})? {
CloseTableResult::NotFound | CloseTableResult::Released(_) => {
// Deregister table if The table released.
self.deregister_table(table_ref).await
let deregistered = self.deregister_table(table_ref).await?;
if deregistered {
self.region_alive_keepers
.deregister_table(table_ident)
.await;
}
Ok(deregistered)
}
CloseTableResult::PartialClosed(regions) => {
// Requires caller to update the region_numbers
@@ -210,6 +205,11 @@ impl CloseRegionHandler {
"Close partial regions: {:?} in table: {}",
regions, table_ref
);
self.region_alive_keepers
.deregister_region(&region_ident)
.await;
Ok(true)
}
};

View File

@@ -14,16 +14,16 @@
use std::sync::Arc;
use async_trait::async_trait;
use catalog::error::Error as CatalogError;
use catalog::remote::region_alive_keeper::RegionAliveKeepers;
use catalog::{CatalogManagerRef, RegisterTableRequest};
use common_catalog::format_full_table_name;
use common_meta::error::Result as MetaResult;
use common_meta::heartbeat::handler::{
HandleControl, HeartbeatResponseHandler, HeartbeatResponseHandlerContext,
};
use common_meta::instruction::{
Instruction, InstructionReply, RegionIdent, SimpleReply, TableIdent,
};
use common_meta::instruction::{Instruction, InstructionReply, SimpleReply};
use common_telemetry::{error, warn};
use snafu::ResultExt;
use store_api::storage::RegionNumber;
@@ -37,8 +37,10 @@ use crate::error::{self, Result};
pub struct OpenRegionHandler {
catalog_manager: CatalogManagerRef,
table_engine_manager: TableEngineManagerRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
}
#[async_trait]
impl HeartbeatResponseHandler for OpenRegionHandler {
fn is_acceptable(&self, ctx: &HeartbeatResponseHandlerContext) -> bool {
matches!(
@@ -47,7 +49,7 @@ impl HeartbeatResponseHandler for OpenRegionHandler {
)
}
fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> MetaResult<HandleControl> {
async fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> MetaResult<HandleControl> {
let Some((meta, Instruction::OpenRegion(region_ident))) = ctx.incoming_message.take() else {
unreachable!("OpenRegionHandler: should be guarded by 'is_acceptable'");
};
@@ -55,9 +57,24 @@ impl HeartbeatResponseHandler for OpenRegionHandler {
let mailbox = ctx.mailbox.clone();
let self_ref = Arc::new(self.clone());
let region_alive_keepers = self.region_alive_keepers.clone();
common_runtime::spawn_bg(async move {
let (engine, request) = OpenRegionHandler::prepare_request(region_ident);
let result = self_ref.open_region_inner(engine, request).await;
let table_ident = &region_ident.table_ident;
let request = OpenTableRequest {
catalog_name: table_ident.catalog.clone(),
schema_name: table_ident.schema.clone(),
table_name: table_ident.table.clone(),
table_id: table_ident.table_id,
region_numbers: vec![region_ident.region_number],
};
let result = self_ref
.open_region_inner(table_ident.engine.clone(), request)
.await;
if matches!(result, Ok(true)) {
region_alive_keepers.register_region(&region_ident).await;
}
if let Err(e) = mailbox
.send((meta, OpenRegionHandler::map_result(result)))
.await
@@ -73,10 +90,12 @@ impl OpenRegionHandler {
pub fn new(
catalog_manager: CatalogManagerRef,
table_engine_manager: TableEngineManagerRef,
region_alive_keepers: Arc<RegionAliveKeepers>,
) -> Self {
Self {
catalog_manager,
table_engine_manager,
region_alive_keepers,
}
}
@@ -97,32 +116,6 @@ impl OpenRegionHandler {
)
}
fn prepare_request(ident: RegionIdent) -> (String, OpenTableRequest) {
let RegionIdent {
table_ident:
TableIdent {
catalog,
schema,
table,
table_id,
engine,
},
region_number,
..
} = ident;
(
engine,
OpenTableRequest {
catalog_name: catalog,
schema_name: schema,
table_name: table,
table_id,
region_numbers: vec![region_number],
},
)
}
/// Returns true if a table or target regions have been opened.
async fn regions_opened(
&self,

View File

@@ -18,7 +18,8 @@ use std::time::Duration;
use std::{fs, path};
use api::v1::meta::Role;
use catalog::remote::CachedMetaKvBackend;
use catalog::remote::region_alive_keeper::RegionAliveKeepers;
use catalog::remote::{CachedMetaKvBackend, RemoteCatalogManager};
use catalog::{CatalogManager, CatalogManagerRef, RegisterTableRequest};
use common_base::paths::{CLUSTER_DIR, WAL_DIR};
use common_catalog::consts::{DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME, MIN_USER_TABLE_ID};
@@ -56,9 +57,9 @@ use table::Table;
use crate::datanode::{DatanodeOptions, ObjectStoreConfig, ProcedureConfig, WalConfig};
use crate::error::{
self, CatalogSnafu, MetaClientInitSnafu, MissingMetasrvOptsSnafu, MissingNodeIdSnafu,
NewCatalogSnafu, OpenLogStoreSnafu, RecoverProcedureSnafu, Result, ShutdownInstanceSnafu,
StartProcedureManagerSnafu, StopProcedureManagerSnafu,
self, CatalogSnafu, IncorrectInternalStateSnafu, MetaClientInitSnafu, MissingMetasrvOptsSnafu,
MissingNodeIdSnafu, NewCatalogSnafu, OpenLogStoreSnafu, RecoverProcedureSnafu, Result,
ShutdownInstanceSnafu, StartProcedureManagerSnafu, StopProcedureManagerSnafu,
};
use crate::heartbeat::handler::close_region::CloseRegionHandler;
use crate::heartbeat::handler::open_region::OpenRegionHandler;
@@ -150,7 +151,7 @@ impl Instance {
);
// create remote catalog manager
let (catalog_manager, table_id_provider) = match opts.mode {
let (catalog_manager, table_id_provider, heartbeat_task) = match opts.mode {
Mode::Standalone => {
if opts.enable_memory_catalog {
let catalog = Arc::new(catalog::local::MemoryCatalogManager::default());
@@ -170,6 +171,7 @@ impl Instance {
(
catalog.clone() as CatalogManagerRef,
Some(catalog as TableIdProviderRef),
None,
)
} else {
let catalog = Arc::new(
@@ -181,51 +183,64 @@ impl Instance {
(
catalog.clone() as CatalogManagerRef,
Some(catalog as TableIdProviderRef),
None,
)
}
}
Mode::Distributed => {
let kv_backend = Arc::new(CachedMetaKvBackend::new(
meta_client.as_ref().unwrap().clone(),
let meta_client = meta_client.context(IncorrectInternalStateSnafu {
state: "meta client is not provided when creating distributed Datanode",
})?;
let kv_backend = Arc::new(CachedMetaKvBackend::new(meta_client.clone()));
let heartbeat_interval_millis = 5000;
let region_alive_keepers = Arc::new(RegionAliveKeepers::new(
engine_manager.clone(),
heartbeat_interval_millis,
));
let catalog = Arc::new(catalog::remote::RemoteCatalogManager::new(
let catalog_manager = Arc::new(RemoteCatalogManager::new(
engine_manager.clone(),
opts.node_id.context(MissingNodeIdSnafu)?,
kv_backend,
region_alive_keepers.clone(),
));
(catalog as CatalogManagerRef, None)
let handlers_executor = HandlerGroupExecutor::new(vec![
Arc::new(ParseMailboxMessageHandler::default()),
Arc::new(OpenRegionHandler::new(
catalog_manager.clone(),
engine_manager.clone(),
region_alive_keepers.clone(),
)),
Arc::new(CloseRegionHandler::new(
catalog_manager.clone(),
engine_manager.clone(),
region_alive_keepers.clone(),
)),
region_alive_keepers.clone(),
]);
let heartbeat_task = Some(HeartbeatTask::new(
opts.node_id.context(MissingNodeIdSnafu)?,
opts,
meta_client,
catalog_manager.clone(),
Arc::new(handlers_executor),
heartbeat_interval_millis,
region_alive_keepers,
));
(catalog_manager as CatalogManagerRef, None, heartbeat_task)
}
};
let factory = QueryEngineFactory::new(catalog_manager.clone(), false);
let query_engine = factory.query_engine();
let handlers_executor = HandlerGroupExecutor::new(vec![
Arc::new(ParseMailboxMessageHandler::default()),
Arc::new(OpenRegionHandler::new(
catalog_manager.clone(),
engine_manager.clone(),
)),
Arc::new(CloseRegionHandler::new(
catalog_manager.clone(),
engine_manager.clone(),
)),
]);
let heartbeat_task = match opts.mode {
Mode::Standalone => None,
Mode::Distributed => Some(HeartbeatTask::new(
opts.node_id.context(MissingNodeIdSnafu)?,
opts.rpc_addr.clone(),
opts.rpc_hostname.clone(),
meta_client.as_ref().unwrap().clone(),
catalog_manager.clone(),
Arc::new(handlers_executor),
)),
};
let procedure_manager =
create_procedure_manager(opts.node_id.unwrap_or(0), &opts.procedure, object_store)
.await?;
@@ -354,7 +369,7 @@ impl Instance {
fn create_compaction_scheduler<S: LogStore>(opts: &DatanodeOptions) -> CompactionSchedulerRef<S> {
let picker = SimplePicker::default();
let config = SchedulerConfig::from(opts);
let handler = CompactionHandler::new(picker);
let handler = CompactionHandler { picker };
let scheduler = LocalScheduler::new(config, handler);
Arc::new(scheduler)
}

View File

@@ -106,7 +106,13 @@ impl Instance {
let name = alter_table.table_name().clone();
let (catalog, schema, table) = table_idents_to_full_name(&name, query_ctx.clone())?;
let table_ref = TableReference::full(&catalog, &schema, &table);
let req = SqlHandler::alter_to_request(alter_table, table_ref)?;
// Currently, we have to get the table multiple times. Consider remove the sql handler in the future.
let table = self.sql_handler.get_table(&table_ref).await?;
let req = SqlHandler::alter_to_request(
alter_table,
table_ref,
table.table_info().ident.table_id,
)?;
self.sql_handler
.execute(SqlRequest::Alter(req), query_ctx)
.await
@@ -114,10 +120,13 @@ impl Instance {
Statement::DropTable(drop_table) => {
let (catalog_name, schema_name, table_name) =
table_idents_to_full_name(drop_table.table_name(), query_ctx.clone())?;
let table_ref = TableReference::full(&catalog_name, &schema_name, &table_name);
let table = self.sql_handler.get_table(&table_ref).await?;
let req = DropTableRequest {
catalog_name,
schema_name,
table_name,
table_id: table.table_info().ident.table_id,
};
self.sql_handler
.execute(SqlRequest::DropTable(req), query_ctx)
@@ -228,6 +237,22 @@ pub fn table_idents_to_full_name(
}
}
pub fn idents_to_full_database_name(
obj_name: &ObjectName,
query_ctx: &QueryContextRef,
) -> Result<(String, String)> {
match &obj_name.0[..] {
[database] => Ok((query_ctx.current_catalog(), database.value.clone())),
[catalog, database] => Ok((catalog.value.clone(), database.value.clone())),
_ => error::InvalidSqlSnafu {
msg: format!(
"expect database name to be <catalog>.<database>, <database>, found: {obj_name}",
),
}
.fail(),
}
}
#[async_trait]
impl SqlStatementExecutor for Instance {
async fn execute_sql(

View File

@@ -14,6 +14,7 @@
use api::v1::{AlterExpr, CreateTableExpr, DropTableExpr, FlushTableExpr};
use common_catalog::consts::IMMUTABLE_FILE_ENGINE;
use common_catalog::format_full_table_name;
use common_grpc_expr::{alter_expr_to_request, create_expr_to_request};
use common_query::Output;
use common_telemetry::info;
@@ -22,8 +23,8 @@ use snafu::prelude::*;
use table::requests::{DropTableRequest, FlushTableRequest};
use crate::error::{
AlterExprToRequestSnafu, BumpTableIdSnafu, CreateExprToRequestSnafu,
IncorrectInternalStateSnafu, Result,
AlterExprToRequestSnafu, BumpTableIdSnafu, CatalogSnafu, CreateExprToRequestSnafu,
IncorrectInternalStateSnafu, Result, TableNotFoundSnafu,
};
use crate::instance::Instance;
use crate::sql::SqlRequest;
@@ -69,17 +70,45 @@ impl Instance {
}
pub(crate) async fn handle_alter(&self, expr: AlterExpr) -> Result<Output> {
let request = alter_expr_to_request(expr).context(AlterExprToRequestSnafu)?;
let table = self
.catalog_manager
.table(&expr.catalog_name, &expr.schema_name, &expr.table_name)
.await
.context(CatalogSnafu)?
.with_context(|| TableNotFoundSnafu {
table_name: format_full_table_name(
&expr.catalog_name,
&expr.schema_name,
&expr.table_name,
),
})?;
let request = alter_expr_to_request(table.table_info().ident.table_id, expr)
.context(AlterExprToRequestSnafu)?;
self.sql_handler()
.execute(SqlRequest::Alter(request), QueryContext::arc())
.await
}
pub(crate) async fn handle_drop_table(&self, expr: DropTableExpr) -> Result<Output> {
let table = self
.catalog_manager
.table(&expr.catalog_name, &expr.schema_name, &expr.table_name)
.await
.context(CatalogSnafu)?
.with_context(|| TableNotFoundSnafu {
table_name: format_full_table_name(
&expr.catalog_name,
&expr.schema_name,
&expr.table_name,
),
})?;
let req = DropTableRequest {
catalog_name: expr.catalog_name,
schema_name: expr.schema_name,
table_name: expr.table_name,
table_id: table.table_info().ident.table_id,
};
self.sql_handler()
.execute(SqlRequest::DropTable(req), QueryContext::arc())

View File

@@ -19,6 +19,7 @@ use snafu::prelude::*;
use sql::statements::alter::{AlterTable, AlterTableOperation};
use sql::statements::column_def_to_schema;
use table::engine::TableReference;
use table::metadata::TableId;
use table::requests::{AddColumnRequest, AlterKind, AlterTableRequest};
use table_procedure::AlterTableProcedure;
@@ -60,6 +61,7 @@ impl SqlHandler {
pub(crate) fn alter_to_request(
alter_table: AlterTable,
table_ref: TableReference,
table_id: TableId,
) -> Result<AlterTableRequest> {
let alter_kind = match &alter_table.alter_operation() {
AlterTableOperation::AddConstraint(table_constraint) => {
@@ -91,6 +93,7 @@ impl SqlHandler {
catalog_name: table_ref.catalog.to_string(),
schema_name: table_ref.schema.to_string(),
table_name: table_ref.table.to_string(),
table_id,
alter_kind,
})
}
@@ -128,6 +131,7 @@ mod tests {
let req = SqlHandler::alter_to_request(
alter_table,
TableReference::full("greptime", "public", "my_metric_1"),
1,
)
.unwrap();
assert_eq!(req.catalog_name, "greptime");
@@ -154,6 +158,7 @@ mod tests {
let req = SqlHandler::alter_to_request(
alter_table,
TableReference::full("greptime", "public", "test_table"),
1,
)
.unwrap();
assert_eq!(req.catalog_name, "greptime");

View File

@@ -14,26 +14,29 @@
use std::assert_matches::assert_matches;
use std::sync::Arc;
use std::time::Duration;
use api::v1::greptime_request::Request as GrpcRequest;
use api::v1::meta::HeartbeatResponse;
use api::v1::query_request::Query;
use api::v1::QueryRequest;
use catalog::remote::region_alive_keeper::RegionAliveKeepers;
use catalog::CatalogManagerRef;
use common_meta::heartbeat::handler::{
HandlerGroupExecutor, HeartbeatResponseHandlerContext, HeartbeatResponseHandlerExecutor,
};
use common_meta::heartbeat::mailbox::{HeartbeatMailbox, MessageMeta};
use common_meta::instruction::{
Instruction, InstructionReply, RegionIdent, SimpleReply, TableIdent,
};
use common_meta::ident::TableIdent;
use common_meta::instruction::{Instruction, InstructionReply, RegionIdent, SimpleReply};
use common_query::Output;
use datatypes::prelude::ConcreteDataType;
use servers::query_handler::grpc::GrpcQueryHandler;
use session::context::QueryContext;
use table::engine::manager::TableEngineManagerRef;
use table::TableRef;
use test_util::MockInstance;
use tokio::sync::mpsc::{self, Receiver};
use tokio::time::Instant;
use crate::heartbeat::handler::close_region::CloseRegionHandler;
use crate::heartbeat::handler::open_region::OpenRegionHandler;
@@ -61,7 +64,11 @@ async fn test_close_region_handler() {
} = prepare_handler_test("test_close_region_handler").await;
let executor = Arc::new(HandlerGroupExecutor::new(vec![Arc::new(
CloseRegionHandler::new(catalog_manager_ref.clone(), engine_manager_ref.clone()),
CloseRegionHandler::new(
catalog_manager_ref.clone(),
engine_manager_ref.clone(),
Arc::new(RegionAliveKeepers::new(engine_manager_ref.clone(), 5000)),
),
)]));
prepare_table(instance.inner()).await;
@@ -71,7 +78,8 @@ async fn test_close_region_handler() {
executor.clone(),
mailbox.clone(),
close_region_instruction(),
);
)
.await;
let (_, reply) = rx.recv().await.unwrap();
assert_matches!(
reply,
@@ -85,7 +93,8 @@ async fn test_close_region_handler() {
executor.clone(),
mailbox.clone(),
close_region_instruction(),
);
)
.await;
let (_, reply) = rx.recv().await.unwrap();
assert_matches!(
reply,
@@ -108,7 +117,8 @@ async fn test_close_region_handler() {
cluster_id: 1,
datanode_id: 2,
}),
);
)
.await;
let (_, reply) = rx.recv().await.unwrap();
assert_matches!(
reply,
@@ -127,56 +137,81 @@ async fn test_open_region_handler() {
..
} = prepare_handler_test("test_open_region_handler").await;
let region_alive_keepers = Arc::new(RegionAliveKeepers::new(engine_manager_ref.clone(), 5000));
region_alive_keepers.start().await;
let executor = Arc::new(HandlerGroupExecutor::new(vec![
Arc::new(OpenRegionHandler::new(
catalog_manager_ref.clone(),
engine_manager_ref.clone(),
region_alive_keepers.clone(),
)),
Arc::new(CloseRegionHandler::new(
catalog_manager_ref.clone(),
engine_manager_ref.clone(),
region_alive_keepers.clone(),
)),
]));
prepare_table(instance.inner()).await;
let instruction = open_region_instruction();
let Instruction::OpenRegion(region_ident) = instruction.clone() else { unreachable!() };
let table_ident = &region_ident.table_ident;
let table = prepare_table(instance.inner()).await;
region_alive_keepers
.register_table(table_ident.clone(), table)
.await
.unwrap();
// Opens a opened table
handle_instruction(executor.clone(), mailbox.clone(), open_region_instruction());
handle_instruction(executor.clone(), mailbox.clone(), instruction.clone()).await;
let (_, reply) = rx.recv().await.unwrap();
assert_matches!(
reply,
InstructionReply::OpenRegion(SimpleReply { result: true, .. })
);
let keeper = region_alive_keepers.find_keeper(table_ident).await.unwrap();
let deadline = keeper.deadline(0).await.unwrap();
assert!(deadline <= Instant::now() + Duration::from_secs(20));
// Opens a non-exist table
let non_exist_table_ident = TableIdent {
catalog: "greptime".to_string(),
schema: "public".to_string(),
table: "non-exist".to_string(),
table_id: 2024,
engine: "mito".to_string(),
};
handle_instruction(
executor.clone(),
mailbox.clone(),
Instruction::OpenRegion(RegionIdent {
table_ident: TableIdent {
catalog: "greptime".to_string(),
schema: "public".to_string(),
table: "non-exist".to_string(),
table_id: 2024,
engine: "mito".to_string(),
},
table_ident: non_exist_table_ident.clone(),
region_number: 0,
cluster_id: 1,
datanode_id: 2,
}),
);
)
.await;
let (_, reply) = rx.recv().await.unwrap();
assert_matches!(
reply,
InstructionReply::OpenRegion(SimpleReply { result: false, .. })
);
assert!(region_alive_keepers
.find_keeper(&non_exist_table_ident)
.await
.is_none());
// Closes demo table
handle_instruction(
executor.clone(),
mailbox.clone(),
close_region_instruction(),
);
)
.await;
let (_, reply) = rx.recv().await.unwrap();
assert_matches!(
reply,
@@ -184,8 +219,13 @@ async fn test_open_region_handler() {
);
assert_test_table_not_found(instance.inner()).await;
assert!(region_alive_keepers
.find_keeper(table_ident)
.await
.is_none());
// Opens demo table
handle_instruction(executor.clone(), mailbox.clone(), open_region_instruction());
handle_instruction(executor.clone(), mailbox.clone(), instruction).await;
let (_, reply) = rx.recv().await.unwrap();
assert_matches!(
reply,
@@ -220,7 +260,7 @@ pub fn test_message_meta(id: u64, subject: &str, to: &str, from: &str) -> Messag
}
}
fn handle_instruction(
async fn handle_instruction(
executor: Arc<dyn HeartbeatResponseHandlerExecutor>,
mailbox: Arc<HeartbeatMailbox>,
instruction: Instruction,
@@ -229,7 +269,7 @@ fn handle_instruction(
let mut ctx: HeartbeatResponseHandlerContext =
HeartbeatResponseHandlerContext::new(mailbox, response);
ctx.incoming_message = Some((test_message_meta(1, "hi", "foo", "bar"), instruction));
executor.handle(ctx).unwrap();
executor.handle(ctx).await.unwrap();
}
fn close_region_instruction() -> Instruction {
@@ -262,10 +302,10 @@ fn open_region_instruction() -> Instruction {
})
}
async fn prepare_table(instance: &Instance) {
async fn prepare_table(instance: &Instance) -> TableRef {
test_util::create_test_table(instance, ConcreteDataType::timestamp_millisecond_datatype())
.await
.unwrap();
.unwrap()
}
async fn assert_test_table_not_found(instance: &Instance) {

View File

@@ -22,6 +22,7 @@ use servers::Mode;
use snafu::ResultExt;
use table::engine::{EngineContext, TableEngineRef};
use table::requests::{CreateTableRequest, TableOptions};
use table::TableRef;
use crate::datanode::{
DatanodeOptions, FileConfig, ObjectStoreConfig, ProcedureConfig, StorageConfig, WalConfig,
@@ -84,7 +85,7 @@ fn create_tmp_dir_and_datanode_opts(name: &str) -> (DatanodeOptions, TestGuard)
pub(crate) async fn create_test_table(
instance: &Instance,
ts_type: ConcreteDataType,
) -> Result<()> {
) -> Result<TableRef> {
let column_schemas = vec![
ColumnSchema::new("host", ConcreteDataType::string_datatype(), true),
ColumnSchema::new("cpu", ConcreteDataType::float64_datatype(), true),
@@ -125,8 +126,8 @@ pub(crate) async fn create_test_table(
.unwrap()
.unwrap();
schema_provider
.register_table(table_name.to_string(), table)
.register_table(table_name.to_string(), table.clone())
.await
.unwrap();
Ok(())
Ok(table)
}

View File

@@ -183,6 +183,12 @@ impl ConcreteDataType {
}
}
impl From<&ConcreteDataType> for ConcreteDataType {
fn from(t: &ConcreteDataType) -> Self {
t.clone()
}
}
impl TryFrom<&ArrowDataType> for ConcreteDataType {
type Error = Error;

View File

@@ -248,7 +248,7 @@ impl Value {
Value::Binary(v) => ScalarValue::LargeBinary(Some(v.to_vec())),
Value::Date(v) => ScalarValue::Date32(Some(v.val())),
Value::DateTime(v) => ScalarValue::Date64(Some(v.val())),
Value::Null => to_null_value(output_type),
Value::Null => to_null_scalar_value(output_type),
Value::List(list) => {
// Safety: The logical type of the value and output_type are the same.
let list_type = output_type.as_list().unwrap();
@@ -261,7 +261,7 @@ impl Value {
}
}
fn to_null_value(output_type: &ConcreteDataType) -> ScalarValue {
pub fn to_null_scalar_value(output_type: &ConcreteDataType) -> ScalarValue {
match output_type {
ConcreteDataType::Null(_) => ScalarValue::Null,
ConcreteDataType::Boolean(_) => ScalarValue::Boolean(None),
@@ -285,7 +285,7 @@ fn to_null_value(output_type: &ConcreteDataType) -> ScalarValue {
}
ConcreteDataType::Dictionary(dict) => ScalarValue::Dictionary(
Box::new(dict.key_type().as_arrow_type()),
Box::new(to_null_value(dict.value_type())),
Box::new(to_null_scalar_value(dict.value_type())),
),
}
}

View File

@@ -25,7 +25,7 @@ use object_store::ObjectStore;
use snafu::ResultExt;
use table::engine::{table_dir, EngineContext, TableEngine, TableEngineProcedure, TableReference};
use table::error::TableOperationSnafu;
use table::metadata::{TableInfo, TableInfoBuilder, TableMetaBuilder, TableType};
use table::metadata::{TableId, TableInfo, TableInfoBuilder, TableMetaBuilder, TableType};
use table::requests::{AlterTableRequest, CreateTableRequest, DropTableRequest, OpenTableRequest};
use table::{error as table_error, Result as TableResult, Table, TableRef};
use tokio::sync::Mutex;
@@ -88,16 +88,12 @@ impl TableEngine for ImmutableFileTableEngine {
.fail()
}
fn get_table(
&self,
_ctx: &EngineContext,
table_ref: &TableReference,
) -> TableResult<Option<TableRef>> {
Ok(self.inner.get_table(table_ref))
fn get_table(&self, _ctx: &EngineContext, table_id: TableId) -> TableResult<Option<TableRef>> {
Ok(self.inner.get_table(table_id))
}
fn table_exists(&self, _ctx: &EngineContext, table_ref: &TableReference) -> bool {
self.inner.get_table(table_ref).is_some()
fn table_exists(&self, _ctx: &EngineContext, table_id: TableId) -> bool {
self.inner.get_table(table_id).is_some()
}
async fn drop_table(
@@ -151,8 +147,8 @@ impl TableEngineProcedure for ImmutableFileTableEngine {
#[cfg(test)]
impl ImmutableFileTableEngine {
pub async fn close_table(&self, table_ref: &TableReference<'_>) -> TableResult<()> {
self.inner.close_table(table_ref).await
pub async fn close_table(&self, table_id: TableId) -> TableResult<()> {
self.inner.close_table(table_id).await
}
}
@@ -173,10 +169,10 @@ impl ImmutableFileTableEngine {
}
struct EngineInner {
/// All tables opened by the engine. Map key is formatted [TableReference].
/// All tables opened by the engine.
///
/// Writing to `tables` should also hold the `table_mutex`.
tables: RwLock<HashMap<String, ImmutableFileTableRef>>,
tables: RwLock<HashMap<TableId, ImmutableFileTableRef>>,
object_store: ObjectStore,
/// Table mutex is used to protect the operations such as creating/opening/closing
@@ -199,6 +195,7 @@ impl EngineInner {
request: CreateTableRequest,
) -> Result<TableRef> {
let CreateTableRequest {
id: table_id,
catalog_name,
schema_name,
table_name,
@@ -212,7 +209,7 @@ impl EngineInner {
table: &table_name,
};
if let Some(table) = self.get_table(&table_ref) {
if let Some(table) = self.get_table(table_id) {
return if create_if_not_exists {
Ok(table)
} else {
@@ -223,14 +220,13 @@ impl EngineInner {
let table_schema =
Arc::new(Schema::try_from(request.schema).context(InvalidRawSchemaSnafu)?);
let table_id = request.id;
let table_dir = table_dir(&catalog_name, &schema_name, table_id);
let table_full_name = table_ref.to_string();
let _lock = self.table_mutex.lock().await;
// Checks again, read lock should be enough since we are guarded by the mutex.
if let Some(table) = self.get_table_by_full_name(&table_full_name) {
if let Some(table) = self.get_table(table_id) {
return if request.create_if_not_exists {
Ok(table)
} else {
@@ -279,27 +275,20 @@ impl EngineInner {
table_id
);
self.tables
.write()
.unwrap()
.insert(table_full_name, table.clone());
self.tables.write().unwrap().insert(table_id, table.clone());
Ok(table)
}
fn get_table_by_full_name(&self, full_name: &str) -> Option<TableRef> {
fn get_table(&self, table_id: TableId) -> Option<TableRef> {
self.tables
.read()
.unwrap()
.get(full_name)
.get(&table_id)
.cloned()
.map(|table| table as _)
}
fn get_table(&self, table_ref: &TableReference) -> Option<TableRef> {
self.get_table_by_full_name(&table_ref.to_string())
}
async fn open_table(
&self,
_ctx: &EngineContext,
@@ -309,6 +298,7 @@ impl EngineInner {
catalog_name,
schema_name,
table_name,
table_id,
..
} = request;
let table_ref = TableReference {
@@ -317,16 +307,15 @@ impl EngineInner {
table: &table_name,
};
let table_full_name = table_ref.to_string();
if let Some(table) = self.get_table_by_full_name(&table_full_name) {
if let Some(table) = self.get_table(table_id) {
return Ok(Some(table));
}
let table_full_name = table_ref.to_string();
let table = {
let _lock = self.table_mutex.lock().await;
// Checks again, read lock should be enough since we are guarded by the mutex.
if let Some(table) = self.get_table_by_full_name(&table_full_name) {
if let Some(table) = self.get_table(table_id) {
return Ok(Some(table));
}
@@ -350,10 +339,7 @@ impl EngineInner {
.context(table_error::TableOperationSnafu)?,
);
self.tables
.write()
.unwrap()
.insert(table_full_name, table.clone());
self.tables.write().unwrap().insert(table_id, table.clone());
Some(table as _)
};
@@ -375,7 +361,7 @@ impl EngineInner {
let table_full_name = table_ref.to_string();
let _lock = self.table_mutex.lock().await;
if let Some(table) = self.get_table_by_full_name(&table_full_name) {
if let Some(table) = self.get_table(req.table_id) {
let table_id = table.table_info().ident.table_id;
let table_dir = table_dir(&req.catalog_name, &req.schema_name, table_id);
@@ -389,7 +375,7 @@ impl EngineInner {
.context(DropTableSnafu {
table_name: &table_full_name,
})?;
self.tables.write().unwrap().remove(&table_full_name);
self.tables.write().unwrap().remove(&req.table_id);
Ok(true)
} else {
@@ -429,12 +415,10 @@ impl EngineInner {
#[cfg(test)]
impl EngineInner {
pub async fn close_table(&self, table_ref: &TableReference<'_>) -> TableResult<()> {
let full_name = table_ref.to_string();
pub async fn close_table(&self, table_id: TableId) -> TableResult<()> {
let _lock = self.table_mutex.lock().await;
if let Some(table) = self.get_table_by_full_name(&full_name) {
if let Some(table) = self.get_table(table_id) {
let regions = Vec::new();
table
.close(&regions)
@@ -443,7 +427,7 @@ impl EngineInner {
.context(table_error::TableOperationSnafu)?;
}
self.tables.write().unwrap().remove(&full_name);
self.tables.write().unwrap().remove(&table_id);
Ok(())
}

View File

@@ -92,14 +92,13 @@ impl CreateImmutableFileTable {
fn on_prepare(&mut self) -> Result<Status> {
let engine_ctx = EngineContext::default();
let table_ref = self.data.table_ref();
// Safety: Current get_table implementation always returns Ok.
if self.engine.table_exists(&engine_ctx, &table_ref) {
if self.engine.table_exists(&engine_ctx, self.data.request.id) {
// The table already exists.
ensure!(
self.data.request.create_if_not_exists,
TableExistsSnafu {
table_name: table_ref.to_string(),
table_name: self.data.table_ref().to_string(),
}
);
@@ -113,8 +112,7 @@ impl CreateImmutableFileTable {
async fn on_create_table(&mut self) -> Result<Status> {
let engine_ctx = EngineContext::default();
let table_ref = self.data.table_ref();
if self.engine.table_exists(&engine_ctx, &table_ref) {
if self.engine.table_exists(&engine_ctx, self.data.request.id) {
// Table already created. We don't need to check create_if_not_exists as
// we have checked it in prepare state.
return Ok(Status::Done);

View File

@@ -16,7 +16,7 @@ use std::assert_matches::assert_matches;
use std::sync::Arc;
use common_catalog::consts::{DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME, IMMUTABLE_FILE_ENGINE};
use table::engine::{EngineContext, TableEngine, TableEngineProcedure, TableReference};
use table::engine::{EngineContext, TableEngine, TableEngineProcedure};
use table::requests::{AlterKind, AlterTableRequest, DropTableRequest, OpenTableRequest};
use table::{error as table_error, Table};
@@ -35,14 +35,9 @@ async fn test_get_table() {
..
} = test_util::setup_test_engine_and_table("test_get_table").await;
let table_info = table.table_info();
let table_ref = TableReference {
catalog: &table_info.catalog_name,
schema: &table_info.schema_name,
table: &table_info.name,
};
let got = table_engine
.get_table(&EngineContext::default(), &table_ref)
.get_table(&EngineContext::default(), table_info.ident.table_id)
.unwrap()
.unwrap();
@@ -53,21 +48,17 @@ async fn test_get_table() {
async fn test_open_table() {
common_telemetry::init_default_ut_logging();
let ctx = EngineContext::default();
// the test table id is 1
let table_id = 1;
let open_req = OpenTableRequest {
catalog_name: DEFAULT_CATALOG_NAME.to_string(),
schema_name: DEFAULT_SCHEMA_NAME.to_string(),
table_name: test_util::TEST_TABLE_NAME.to_string(),
// the test table id is 1
table_id: 1,
table_id,
region_numbers: vec![0],
};
let table_ref = TableReference {
catalog: DEFAULT_CATALOG_NAME,
schema: DEFAULT_SCHEMA_NAME,
table: test_util::TEST_TABLE_NAME,
};
let TestEngineComponents {
table_engine,
table_ref: table,
@@ -77,7 +68,7 @@ async fn test_open_table() {
assert_eq!(IMMUTABLE_FILE_ENGINE, table_engine.name());
table_engine.close_table(&table_ref).await.unwrap();
table_engine.close_table(table_id).await.unwrap();
let reopened = table_engine
.open_table(&ctx, open_req.clone())
@@ -101,21 +92,17 @@ async fn test_open_table() {
async fn test_close_all_table() {
common_telemetry::init_default_ut_logging();
let table_ref = TableReference {
catalog: DEFAULT_CATALOG_NAME,
schema: DEFAULT_SCHEMA_NAME,
table: test_util::TEST_TABLE_NAME,
};
let TestEngineComponents {
table_engine,
dir: _dir,
table_ref: table,
..
} = test_util::setup_test_engine_and_table("test_close_all_table").await;
table_engine.close().await.unwrap();
let exist = table_engine.table_exists(&EngineContext::default(), &table_ref);
let table_id = table.table_info().ident.table_id;
let exist = table_engine.table_exists(&EngineContext::default(), table_id);
assert!(!exist);
}
@@ -126,6 +113,7 @@ async fn test_alter_table() {
let TestEngineComponents {
table_engine,
dir: _dir,
table_ref,
..
} = test_util::setup_test_engine_and_table("test_alter_table").await;
@@ -133,6 +121,7 @@ async fn test_alter_table() {
catalog_name: DEFAULT_CATALOG_NAME.to_string(),
schema_name: DEFAULT_SCHEMA_NAME.to_string(),
table_name: TEST_TABLE_NAME.to_string(),
table_id: table_ref.table_info().ident.table_id,
alter_kind: AlterKind::RenameTable {
new_table_name: "foo".to_string(),
},
@@ -151,12 +140,6 @@ async fn test_alter_table() {
async fn test_drop_table() {
common_telemetry::init_default_ut_logging();
let drop_req = DropTableRequest {
catalog_name: DEFAULT_CATALOG_NAME.to_string(),
schema_name: DEFAULT_SCHEMA_NAME.to_string(),
table_name: TEST_TABLE_NAME.to_string(),
};
let TestEngineComponents {
table_engine,
object_store,
@@ -167,12 +150,13 @@ async fn test_drop_table() {
} = test_util::setup_test_engine_and_table("test_drop_table").await;
let table_info = table.table_info();
let table_ref = TableReference {
catalog: &table_info.catalog_name,
schema: &table_info.schema_name,
table: &table_info.name,
};
let drop_req = DropTableRequest {
catalog_name: DEFAULT_CATALOG_NAME.to_string(),
schema_name: DEFAULT_SCHEMA_NAME.to_string(),
table_name: TEST_TABLE_NAME.to_string(),
table_id: table_info.ident.table_id,
};
let dropped = table_engine
.drop_table(&EngineContext::default(), drop_req)
.await
@@ -180,7 +164,7 @@ async fn test_drop_table() {
assert!(dropped);
let exist = table_engine.table_exists(&EngineContext::default(), &table_ref);
let exist = table_engine.table_exists(&EngineContext::default(), table_info.ident.table_id);
assert!(!exist);
// check table_dir manifest
@@ -203,13 +187,14 @@ async fn test_create_drop_table_procedure() {
let engine_ctx = EngineContext::default();
// Test create table by procedure.
let create_request = test_util::new_create_request(schema);
let table_id = create_request.id;
let mut procedure = table_engine
.create_table_procedure(&engine_ctx, create_request.clone())
.unwrap();
common_procedure_test::execute_procedure_until_done(&mut procedure).await;
assert!(table_engine
.get_table(&engine_ctx, &create_request.table_ref())
.get_table(&engine_ctx, table_id)
.unwrap()
.is_some());
@@ -218,6 +203,7 @@ async fn test_create_drop_table_procedure() {
catalog_name: DEFAULT_CATALOG_NAME.to_string(),
schema_name: DEFAULT_SCHEMA_NAME.to_string(),
table_name: TEST_TABLE_NAME.to_string(),
table_id,
};
let mut procedure = table_engine
.drop_table_procedure(&engine_ctx, drop_request)
@@ -225,7 +211,7 @@ async fn test_create_drop_table_procedure() {
common_procedure_test::execute_procedure_until_done(&mut procedure).await;
assert!(table_engine
.get_table(&engine_ctx, &create_request.table_ref())
.get_table(&engine_ctx, table_id)
.unwrap()
.is_none());
}

View File

@@ -14,6 +14,7 @@
use std::any::Any;
use common_datasource::file_format::Format;
use common_error::prelude::*;
use datafusion::arrow::error::ArrowError;
use datafusion::error::DataFusionError;
@@ -175,6 +176,9 @@ pub enum Error {
source: datatypes::error::Error,
location: Location,
},
#[snafu(display("Unsupported format: {:?}", format))]
UnsupportedFormat { format: Format, location: Location },
}
pub type Result<T> = std::result::Result<T, Error>;
@@ -191,7 +195,8 @@ impl ErrorExt for Error {
| BuildCsvConfig { .. }
| ProjectSchema { .. }
| MissingRequiredField { .. }
| ConvertSchema { .. } => StatusCode::InvalidArguments,
| ConvertSchema { .. }
| UnsupportedFormat { .. } => StatusCode::InvalidArguments,
BuildBackend { source, .. } => source.status_code(),
BuildStreamAdapter { source, .. } => source.status_code(),

View File

@@ -94,29 +94,8 @@ fn build_scan_plan<T: FileOpener + Send + 'static>(
projection: Option<&Vec<usize>>,
limit: Option<usize>,
) -> Result<PhysicalPlanRef> {
let stream = FileStream::new(
&FileScanConfig {
object_store_url: ObjectStoreUrl::parse("empty://").unwrap(), // won't be used
file_schema,
file_groups: vec![files
.iter()
.map(|filename| PartitionedFile::new(filename.to_string(), 0))
.collect::<Vec<_>>()],
statistics: Default::default(),
projection: projection.cloned(),
limit,
table_partition_cols: vec![],
output_ordering: None,
infinite_source: false,
},
0, // partition: hard-code
opener,
&ExecutionPlanMetricsSet::new(),
)
.context(error::BuildStreamSnafu)?;
let adapter = RecordBatchStreamAdapter::try_new(Box::pin(stream))
.context(error::BuildStreamAdapterSnafu)?;
Ok(Arc::new(StreamScanAdapter::new(Box::pin(adapter))))
let adapter = build_record_batch_stream(opener, file_schema, files, projection, limit)?;
Ok(Arc::new(StreamScanAdapter::new(adapter)))
}
fn build_record_batch_stream<T: FileOpener + Send + 'static>(
@@ -382,6 +361,7 @@ pub fn create_physical_plan(
Format::Csv(format) => new_csv_scan_plan(ctx, config, format),
Format::Json(format) => new_json_scan_plan(ctx, config, format),
Format::Parquet(format) => new_parquet_scan_plan(ctx, config, format),
_ => error::UnsupportedFormatSnafu { format: *format }.fail(),
}
}
@@ -394,5 +374,6 @@ pub fn create_stream(
Format::Csv(format) => new_csv_stream(ctx, config, format),
Format::Json(format) => new_json_stream(ctx, config, format),
Format::Parquet(format) => new_parquet_stream(ctx, config, format),
_ => error::UnsupportedFormatSnafu { format: *format }.fail(),
}
}

View File

@@ -29,6 +29,7 @@ common-meta = { path = "../common/meta" }
common-recordbatch = { path = "../common/recordbatch" }
common-runtime = { path = "../common/runtime" }
common-telemetry = { path = "../common/telemetry" }
common-time = { path = "../common/time" }
datafusion.workspace = true
datafusion-common.workspace = true
datafusion-expr.workspace = true

View File

@@ -14,6 +14,7 @@
use std::any::Any;
use common_datasource::file_format::Format;
use common_error::prelude::*;
use datafusion::parquet;
use datatypes::arrow::error::ArrowError;
@@ -279,6 +280,13 @@ pub enum Error {
source: query::error::Error,
},
#[snafu(display("Failed to read table: {table_name}, source: {source}"))]
ReadTable {
table_name: String,
#[snafu(backtrace)]
source: query::error::Error,
},
#[snafu(display("Failed to execute logical plan, source: {}", source))]
ExecLogicalPlan {
#[snafu(backtrace)]
@@ -363,13 +371,22 @@ pub enum Error {
},
// TODO(ruihang): merge all query execution error kinds
#[snafu(display("failed to execute PromQL query {}, source: {}", query, source))]
#[snafu(display("Failed to execute PromQL query {}, source: {}", query, source))]
ExecutePromql {
query: String,
#[snafu(backtrace)]
source: servers::error::Error,
},
#[snafu(display(
"Failed to create logical plan for prometheus query, source: {}",
source
))]
PrometheusRemoteQueryPlan {
#[snafu(backtrace)]
source: servers::error::Error,
},
#[snafu(display("Failed to describe schema for given statement, source: {}", source))]
DescribeStatement {
#[snafu(backtrace)]
@@ -427,6 +444,9 @@ pub enum Error {
source: common_datasource::error::Error,
},
#[snafu(display("Unsupported format: {:?}", format))]
UnsupportedFormat { location: Location, format: Format },
#[snafu(display("Failed to parse file format, source: {}", source))]
ParseFileFormat {
#[snafu(backtrace)]
@@ -484,6 +504,12 @@ pub enum Error {
location: Location,
},
#[snafu(display("Failed to read orc schema, source: {}", source))]
ReadOrc {
source: common_datasource::error::Error,
location: Location,
},
#[snafu(display("Failed to build parquet record batch stream, source: {}", source))]
BuildParquetRecordBatchStream {
location: Location,
@@ -532,6 +558,13 @@ pub enum Error {
#[snafu(backtrace)]
source: query::error::Error,
},
#[snafu(display("Invalid COPY parameter, key: {}, value: {}", key, value))]
InvalidCopyParameter {
key: String,
value: String,
location: Location,
},
}
pub type Result<T> = std::result::Result<T, Error>;
@@ -552,14 +585,16 @@ impl ErrorExt for Error {
| Error::InvalidSchema { .. }
| Error::PrepareImmutableTable { .. }
| Error::BuildCsvConfig { .. }
| Error::ProjectSchema { .. } => StatusCode::InvalidArguments,
| Error::ProjectSchema { .. }
| Error::UnsupportedFormat { .. } => StatusCode::InvalidArguments,
Error::NotSupported { .. } => StatusCode::Unsupported,
Error::HandleHeartbeatResponse { source, .. } => source.status_code(),
Error::RuntimeResource { source, .. } => source.status_code(),
Error::ExecutePromql { source, .. } => source.status_code(),
Error::PrometheusRemoteQueryPlan { source, .. }
| Error::ExecutePromql { source, .. } => source.status_code(),
Error::SqlExecIntercepted { source, .. } => source.status_code(),
Error::StartServer { source, .. } => source.status_code(),
@@ -621,6 +656,7 @@ impl ErrorExt for Error {
Error::ExecuteStatement { source, .. }
| Error::PlanStatement { source }
| Error::ParseQuery { source }
| Error::ReadTable { source, .. }
| Error::ExecLogicalPlan { source }
| Error::DescribeStatement { source } => source.status_code(),
@@ -642,13 +678,16 @@ impl ErrorExt for Error {
Error::TableScanExec { source, .. } => source.status_code(),
Error::ReadObject { .. } | Error::ReadParquet { .. } => StatusCode::StorageUnavailable,
Error::ReadObject { .. } | Error::ReadParquet { .. } | Error::ReadOrc { .. } => {
StatusCode::StorageUnavailable
}
Error::ListObjects { source }
| Error::ParseUrl { source }
| Error::BuildBackend { source } => source.status_code(),
Error::WriteParquet { source, .. } => source.status_code(),
Error::InvalidCopyParameter { .. } => StatusCode::InvalidArguments,
}
}

View File

@@ -20,8 +20,7 @@ use common_meta::heartbeat::handler::{
};
use common_meta::heartbeat::mailbox::{HeartbeatMailbox, MailboxRef, OutgoingMessage};
use common_meta::heartbeat::utils::outgoing_message_to_mailbox_message;
use common_telemetry::tracing::trace;
use common_telemetry::{error, info};
use common_telemetry::{debug, error, info};
use meta_client::client::{HeartbeatSender, HeartbeatStream, MetaClient};
use snafu::ResultExt;
use tokio::sync::mpsc;
@@ -83,16 +82,15 @@ impl HeartbeatTask {
loop {
match resp_stream.message().await {
Ok(Some(resp)) => {
trace!("Received a heartbeat response: {:?}", resp);
debug!("Receiving heartbeat response: {:?}", resp);
let ctx = HeartbeatResponseHandlerContext::new(mailbox.clone(), resp);
if let Err(e) = capture_self.handle_response(ctx) {
if let Err(e) = capture_self.handle_response(ctx).await {
error!(e; "Error while handling heartbeat response");
}
}
Ok(None) => break,
Err(e) => {
error!(e; "Occur error while reading heartbeat response");
capture_self
.start_with_retry(Duration::from_secs(retry_interval))
.await;
@@ -148,16 +146,17 @@ impl HeartbeatTask {
error!(e; "Failed to send heartbeat to metasrv");
break;
} else {
trace!("Send a heartbeat request to metasrv, content: {:?}", req);
debug!("Send a heartbeat request to metasrv, content: {:?}", req);
}
}
}
});
}
fn handle_response(&self, ctx: HeartbeatResponseHandlerContext) -> Result<()> {
async fn handle_response(&self, ctx: HeartbeatResponseHandlerContext) -> Result<()> {
self.resp_handler_executor
.handle(ctx)
.await
.context(error::HandleHeartbeatResponseSnafu)
}

View File

@@ -12,13 +12,15 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use async_trait::async_trait;
use catalog::helper::TableGlobalKey;
use catalog::remote::KvCacheInvalidatorRef;
use common_meta::error::Result as MetaResult;
use common_meta::heartbeat::handler::{
HandleControl, HeartbeatResponseHandler, HeartbeatResponseHandlerContext,
};
use common_meta::instruction::{Instruction, InstructionReply, SimpleReply, TableIdent};
use common_meta::ident::TableIdent;
use common_meta::instruction::{Instruction, InstructionReply, SimpleReply};
use common_meta::table_name::TableName;
use common_telemetry::{error, info};
use partition::manager::TableRouteCacheInvalidatorRef;
@@ -29,6 +31,7 @@ pub struct InvalidateTableCacheHandler {
table_route_cache_invalidator: TableRouteCacheInvalidatorRef,
}
#[async_trait]
impl HeartbeatResponseHandler for InvalidateTableCacheHandler {
fn is_acceptable(&self, ctx: &HeartbeatResponseHandlerContext) -> bool {
matches!(
@@ -37,7 +40,7 @@ impl HeartbeatResponseHandler for InvalidateTableCacheHandler {
)
}
fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> MetaResult<HandleControl> {
async fn handle(&self, ctx: &mut HeartbeatResponseHandlerContext) -> MetaResult<HandleControl> {
// TODO(weny): considers introducing a macro
let Some((meta, Instruction::InvalidateTableCache(table_ident))) = ctx.incoming_message.take() else {
unreachable!("InvalidateTableCacheHandler: should be guarded by 'is_acceptable'");

View File

@@ -23,7 +23,8 @@ use common_meta::heartbeat::handler::{
HandlerGroupExecutor, HeartbeatResponseHandlerContext, HeartbeatResponseHandlerExecutor,
};
use common_meta::heartbeat::mailbox::{HeartbeatMailbox, MessageMeta};
use common_meta::instruction::{Instruction, InstructionReply, SimpleReply, TableIdent};
use common_meta::ident::TableIdent;
use common_meta::instruction::{Instruction, InstructionReply, SimpleReply};
use common_meta::table_name::TableName;
use partition::manager::TableRouteCacheInvalidator;
use tokio::sync::mpsc;
@@ -89,7 +90,8 @@ async fn test_invalidate_table_cache_handler() {
table_id: 0,
engine: "mito".to_string(),
}),
);
)
.await;
let (_, reply) = rx.recv().await.unwrap();
assert_matches!(
@@ -125,7 +127,8 @@ async fn test_invalidate_table_cache_handler() {
table_id: 0,
engine: "mito".to_string(),
}),
);
)
.await;
let (_, reply) = rx.recv().await.unwrap();
assert_matches!(
@@ -143,7 +146,7 @@ pub fn test_message_meta(id: u64, subject: &str, to: &str, from: &str) -> Messag
}
}
fn handle_instruction(
async fn handle_instruction(
executor: Arc<dyn HeartbeatResponseHandlerExecutor>,
mailbox: Arc<HeartbeatMailbox>,
instruction: Instruction,
@@ -152,5 +155,5 @@ fn handle_instruction(
let mut ctx: HeartbeatResponseHandlerContext =
HeartbeatResponseHandlerContext::new(mailbox, response);
ctx.incoming_message = Some((test_message_meta(1, "hi", "foo", "bar"), instruction));
executor.handle(ctx).unwrap();
executor.handle(ctx).await.unwrap();
}

View File

@@ -53,7 +53,9 @@ use meta_client::MetaClientOptions;
use partition::manager::PartitionRuleManager;
use partition::route::TableRoutes;
use query::parser::{PromQuery, QueryLanguageParser, QueryStatement};
use query::plan::LogicalPlan;
use query::query_engine::options::{validate_catalog_and_schema, QueryOptions};
use query::query_engine::DescribeResult;
use query::{QueryEngineFactory, QueryEngineRef};
use servers::error as server_error;
use servers::error::{ExecuteQuerySnafu, ParsePromQLSnafu};
@@ -73,8 +75,9 @@ use sql::statements::statement::Statement;
use crate::catalog::FrontendCatalogManager;
use crate::error::{
self, Error, ExecutePromqlSnafu, ExternalSnafu, InvalidInsertRequestSnafu,
MissingMetasrvOptsSnafu, ParseSqlSnafu, PlanStatementSnafu, Result, SqlExecInterceptedSnafu,
self, Error, ExecLogicalPlanSnafu, ExecutePromqlSnafu, ExternalSnafu,
InvalidInsertRequestSnafu, MissingMetasrvOptsSnafu, ParseSqlSnafu, PlanStatementSnafu, Result,
SqlExecInterceptedSnafu,
};
use crate::expr_factory::{CreateExprFactoryRef, DefaultCreateExprFactory};
use crate::frontend::FrontendOptions;
@@ -506,6 +509,14 @@ impl SqlQueryHandler for Instance {
}
}
async fn do_exec_plan(&self, plan: LogicalPlan, query_ctx: QueryContextRef) -> Result<Output> {
let _timer = timer!(metrics::METRIC_EXEC_PLAN_ELAPSED);
self.query_engine
.execute(plan, query_ctx)
.await
.context(ExecLogicalPlanSnafu)
}
async fn do_promql_query(
&self,
query: &PromQuery,
@@ -523,8 +534,11 @@ impl SqlQueryHandler for Instance {
&self,
stmt: Statement,
query_ctx: QueryContextRef,
) -> Result<Option<Schema>> {
if let Statement::Query(_) = stmt {
) -> Result<Option<DescribeResult>> {
if matches!(
stmt,
Statement::Insert(_) | Statement::Query(_) | Statement::Delete(_)
) {
let plan = self
.query_engine
.planner()
@@ -613,12 +627,15 @@ pub fn check_permission(
Statement::DescribeTable(stmt) => {
validate_param(stmt.name(), query_ctx)?;
}
Statement::Copy(stmd) => match stmd {
Statement::Copy(sql::statements::copy::Copy::CopyTable(stmt)) => match stmt {
CopyTable::To(copy_table_to) => validate_param(&copy_table_to.table_name, query_ctx)?,
CopyTable::From(copy_table_from) => {
validate_param(&copy_table_from.table_name, query_ctx)?
}
},
Statement::Copy(sql::statements::copy::Copy::CopyDatabase(stmt)) => {
validate_param(&stmt.database_name, query_ctx)?
}
}
Ok(())
}

View File

@@ -543,8 +543,11 @@ impl DistInstance {
table_name: format_full_table_name(catalog_name, schema_name, table_name),
})?;
let request = common_grpc_expr::alter_expr_to_request(expr.clone())
.context(AlterExprToRequestSnafu)?;
let request = common_grpc_expr::alter_expr_to_request(
table.table_info().ident.table_id,
expr.clone(),
)
.context(AlterExprToRequestSnafu)?;
let mut context = AlterContext::with_capacity(1);

View File

@@ -14,9 +14,8 @@
use api::prometheus::remote::read_request::ResponseType;
use api::prometheus::remote::{Query, QueryResult, ReadRequest, ReadResponse, WriteRequest};
use api::v1::greptime_request::Request;
use api::v1::{query_request, QueryRequest};
use async_trait::async_trait;
use common_catalog::format_full_table_name;
use common_error::prelude::BoxedError;
use common_query::Output;
use common_recordbatch::RecordBatches;
@@ -25,11 +24,14 @@ use metrics::counter;
use prost::Message;
use servers::error::{self, Result as ServerResult};
use servers::prometheus::{self, Metrics};
use servers::query_handler::grpc::GrpcQueryHandler;
use servers::query_handler::{PrometheusProtocolHandler, PrometheusResponse};
use session::context::QueryContextRef;
use snafu::{OptionExt, ResultExt};
use crate::error::{
CatalogSnafu, ExecLogicalPlanSnafu, PrometheusRemoteQueryPlanSnafu, ReadTableSnafu, Result,
TableNotFoundSnafu,
};
use crate::instance::Instance;
use crate::metrics::PROMETHEUS_REMOTE_WRITE_SAMPLES;
@@ -75,6 +77,45 @@ async fn to_query_result(table_name: &str, output: Output) -> ServerResult<Query
}
impl Instance {
async fn handle_remote_query(
&self,
ctx: &QueryContextRef,
catalog_name: &str,
schema_name: &str,
table_name: &str,
query: &Query,
) -> Result<Output> {
let table = self
.catalog_manager
.table(catalog_name, schema_name, table_name)
.await
.context(CatalogSnafu)?
.with_context(|| TableNotFoundSnafu {
table_name: format_full_table_name(catalog_name, schema_name, table_name),
})?;
let dataframe = self
.query_engine
.read_table(table)
.with_context(|_| ReadTableSnafu {
table_name: format_full_table_name(catalog_name, schema_name, table_name),
})?;
let logical_plan =
prometheus::query_to_plan(dataframe, query).context(PrometheusRemoteQueryPlanSnafu)?;
logging::debug!(
"Prometheus remote read, table: {}, logical plan: {}",
table_name,
logical_plan.display_indent(),
);
self.query_engine
.execute(logical_plan, ctx.clone())
.await
.context(ExecLogicalPlanSnafu)
}
async fn handle_remote_queries(
&self,
ctx: QueryContextRef,
@@ -82,22 +123,19 @@ impl Instance {
) -> ServerResult<Vec<(String, Output)>> {
let mut results = Vec::with_capacity(queries.len());
for query in queries {
let (table_name, sql) = prometheus::query_to_sql(query)?;
logging::debug!(
"prometheus remote read, table: {}, sql: {}",
table_name,
sql
);
let catalog_name = ctx.current_catalog();
let schema_name = ctx.current_schema();
for query in queries {
let table_name = prometheus::table_name(query)?;
let query = Request::Query(QueryRequest {
query: Some(query_request::Query::Sql(sql.to_string())),
});
let output = self
.do_query(query, ctx.clone())
.handle_remote_query(&ctx, &catalog_name, &schema_name, &table_name, query)
.await
.map_err(BoxedError::new)
.context(error::ExecuteGrpcQuerySnafu)?;
.with_context(|_| error::ExecuteQuerySnafu {
query: format!("{query:#?}"),
})?;
results.push((table_name, output));
}

View File

@@ -13,6 +13,7 @@
// limitations under the License.
pub(crate) const METRIC_HANDLE_SQL_ELAPSED: &str = "frontend.handle_sql_elapsed";
pub(crate) const METRIC_EXEC_PLAN_ELAPSED: &str = "frontend.exec_plan_elapsed";
pub(crate) const METRIC_HANDLE_SCRIPTS_ELAPSED: &str = "frontend.handle_scripts_elapsed";
pub(crate) const METRIC_RUN_SCRIPT_ELAPSED: &str = "frontend.run_script_elapsed";

View File

@@ -12,32 +12,40 @@
// See the License for the specific language governing permissions and
// limitations under the License.
mod backup;
mod copy_table_from;
mod copy_table_to;
mod describe;
mod show;
mod tql;
use std::collections::HashMap;
use std::str::FromStr;
use catalog::CatalogManagerRef;
use common_error::prelude::BoxedError;
use common_query::Output;
use common_recordbatch::RecordBatches;
use datanode::instance::sql::table_idents_to_full_name;
use common_time::range::TimestampRange;
use common_time::Timestamp;
use datanode::instance::sql::{idents_to_full_database_name, table_idents_to_full_name};
use query::parser::QueryStatement;
use query::query_engine::SqlStatementExecutorRef;
use query::QueryEngineRef;
use session::context::QueryContextRef;
use snafu::{ensure, OptionExt, ResultExt};
use sql::statements::copy::{CopyTable, CopyTableArgument};
use sql::statements::copy::{CopyDatabaseArgument, CopyTable, CopyTableArgument};
use sql::statements::statement::Statement;
use table::engine::TableReference;
use table::requests::{CopyDirection, CopyTableRequest};
use table::requests::{CopyDatabaseRequest, CopyDirection, CopyTableRequest};
use table::TableRef;
use crate::error;
use crate::error::{
CatalogSnafu, ExecLogicalPlanSnafu, ExecuteStatementSnafu, ExternalSnafu, PlanStatementSnafu,
Result, SchemaNotFoundSnafu, TableNotFoundSnafu,
};
use crate::statement::backup::{COPY_DATABASE_TIME_END_KEY, COPY_DATABASE_TIME_START_KEY};
#[derive(Clone)]
pub struct StatementExecutor {
@@ -92,14 +100,23 @@ impl StatementExecutor {
Statement::ShowTables(stmt) => self.show_tables(stmt, query_ctx).await,
Statement::Copy(stmt) => {
Statement::Copy(sql::statements::copy::Copy::CopyTable(stmt)) => {
let req = to_copy_table_request(stmt, query_ctx)?;
match req.direction {
CopyDirection::Export => self.copy_table_to(req).await,
CopyDirection::Import => self.copy_table_from(req).await,
CopyDirection::Export => {
self.copy_table_to(req).await.map(Output::AffectedRows)
}
CopyDirection::Import => {
self.copy_table_from(req).await.map(Output::AffectedRows)
}
}
}
Statement::Copy(sql::statements::copy::Copy::CopyDatabase(arg)) => {
self.copy_database(to_copy_database_request(arg, &query_ctx)?)
.await
}
Statement::CreateDatabase(_)
| Statement::CreateTable(_)
| Statement::CreateExternalTable(_)
@@ -191,5 +208,47 @@ fn to_copy_table_request(stmt: CopyTable, query_ctx: QueryContextRef) -> Result<
connection,
pattern,
direction,
// we copy the whole table by default.
timestamp_range: None,
})
}
/// Converts [CopyDatabaseArgument] to [CopyDatabaseRequest].
/// This function extracts the necessary info including catalog/database name, time range, etc.
fn to_copy_database_request(
arg: CopyDatabaseArgument,
query_ctx: &QueryContextRef,
) -> Result<CopyDatabaseRequest> {
let (catalog_name, database_name) = idents_to_full_database_name(&arg.database_name, query_ctx)
.map_err(BoxedError::new)
.context(ExternalSnafu)?;
let start_timestamp = extract_timestamp(&arg.with, COPY_DATABASE_TIME_START_KEY)?;
let end_timestamp = extract_timestamp(&arg.with, COPY_DATABASE_TIME_END_KEY)?;
let time_range = match (start_timestamp, end_timestamp) {
(Some(start), Some(end)) => TimestampRange::new(start, end),
(Some(start), None) => Some(TimestampRange::from_start(start)),
(None, Some(end)) => Some(TimestampRange::until_end(end, false)), // exclusive end
(None, None) => None,
};
Ok(CopyDatabaseRequest {
catalog_name,
schema_name: database_name,
location: arg.location,
with: arg.with,
connection: arg.connection,
time_range,
})
}
/// Extracts timestamp from a [HashMap<String, String>] with given key.
fn extract_timestamp(map: &HashMap<String, String>, key: &str) -> Result<Option<Timestamp>> {
map.get(key)
.map(|v| {
Timestamp::from_str(v)
.map_err(|_| error::InvalidCopyParameterSnafu { key, value: v }.build())
})
.transpose()
}

View File

@@ -0,0 +1,97 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use common_datasource::file_format::Format;
use common_query::Output;
use common_telemetry::info;
use snafu::{ensure, OptionExt, ResultExt};
use table::requests::{CopyDatabaseRequest, CopyDirection, CopyTableRequest};
use crate::error;
use crate::error::{
CatalogNotFoundSnafu, CatalogSnafu, InvalidCopyParameterSnafu, SchemaNotFoundSnafu,
};
use crate::statement::StatementExecutor;
pub(crate) const COPY_DATABASE_TIME_START_KEY: &str = "start_time";
pub(crate) const COPY_DATABASE_TIME_END_KEY: &str = "end_time";
impl StatementExecutor {
pub(crate) async fn copy_database(&self, req: CopyDatabaseRequest) -> error::Result<Output> {
// location must end with / so that every table is exported to a file.
ensure!(
req.location.ends_with('/'),
InvalidCopyParameterSnafu {
key: "location",
value: req.location,
}
);
info!(
"Copy database {}.{}, dir: {},. time: {:?}",
req.catalog_name, req.schema_name, req.location, req.time_range
);
let schema = self
.catalog_manager
.catalog(&req.catalog_name)
.await
.context(CatalogSnafu)?
.context(CatalogNotFoundSnafu {
catalog_name: &req.catalog_name,
})?
.schema(&req.schema_name)
.await
.context(CatalogSnafu)?
.context(SchemaNotFoundSnafu {
schema_info: &req.schema_name,
})?;
let suffix = Format::try_from(&req.with)
.context(error::ParseFileFormatSnafu)?
.suffix();
let table_names = schema.table_names().await.context(CatalogSnafu)?;
let mut exported_rows = 0;
for table_name in table_names {
// TODO(hl): remove this hardcode once we've removed numbers table.
if table_name == "numbers" {
continue;
}
let mut table_file = req.location.clone();
table_file.push_str(&table_name);
table_file.push_str(suffix);
info!(
"Copy table: {}.{}.{} to {}",
req.catalog_name, req.schema_name, table_name, table_file
);
let exported = self
.copy_table_to(CopyTableRequest {
catalog_name: req.catalog_name.clone(),
schema_name: req.schema_name.clone(),
table_name,
location: table_file,
with: req.with.clone(),
connection: req.connection.clone(),
pattern: None,
direction: CopyDirection::Export,
timestamp_range: req.time_range,
})
.await?;
exported_rows += exported;
}
Ok(Output::AffectedRows(exported_rows))
}
}

View File

@@ -20,11 +20,13 @@ use async_compat::CompatExt;
use common_base::readable_size::ReadableSize;
use common_datasource::file_format::csv::{CsvConfigBuilder, CsvOpener};
use common_datasource::file_format::json::JsonOpener;
use common_datasource::file_format::orc::{
infer_orc_schema, new_orc_stream_reader, OrcArrowStreamReaderAdapter,
};
use common_datasource::file_format::{FileFormat, Format};
use common_datasource::lister::{Lister, Source};
use common_datasource::object_store::{build_backend, parse_url};
use common_datasource::util::find_dir_and_filename;
use common_query::Output;
use common_recordbatch::adapter::ParquetRecordBatchStreamAdapter;
use common_recordbatch::DfSendableRecordBatchStream;
use datafusion::datasource::listing::PartitionedFile;
@@ -111,6 +113,18 @@ impl StatementExecutor {
.context(error::ReadParquetSnafu)?;
Ok(builder.schema().clone())
}
Format::Orc(_) => {
let reader = object_store
.reader(path)
.await
.context(error::ReadObjectSnafu { path })?;
let schema = infer_orc_schema(reader)
.await
.context(error::ReadOrcSnafu)?;
Ok(Arc::new(schema))
}
}
}
@@ -202,10 +216,22 @@ impl StatementExecutor {
Ok(Box::pin(ParquetRecordBatchStreamAdapter::new(upstream)))
}
Format::Orc(_) => {
let reader = object_store
.reader(path)
.await
.context(error::ReadObjectSnafu { path })?;
let stream = new_orc_stream_reader(reader)
.await
.context(error::ReadOrcSnafu)?;
let stream = OrcArrowStreamReaderAdapter::new(stream);
Ok(Box::pin(stream))
}
}
}
pub async fn copy_table_from(&self, req: CopyTableRequest) -> Result<Output> {
pub async fn copy_table_from(&self, req: CopyTableRequest) -> Result<usize> {
let table_ref = TableReference {
catalog: &req.catalog_name,
schema: &req.schema_name,
@@ -313,7 +339,7 @@ impl StatementExecutor {
}
}
Ok(Output::AffectedRows(rows_inserted))
Ok(rows_inserted)
}
}

View File

@@ -18,7 +18,6 @@ use common_datasource::file_format::json::stream_to_json;
use common_datasource::file_format::Format;
use common_datasource::object_store::{build_backend, parse_url};
use common_query::physical_plan::SessionContext;
use common_query::Output;
use common_recordbatch::adapter::DfRecordBatchStreamAdapter;
use common_recordbatch::SendableRecordBatchStream;
use object_store::ObjectStore;
@@ -69,10 +68,11 @@ impl StatementExecutor {
Ok(rows_copied)
}
_ => error::UnsupportedFormatSnafu { format: *format }.fail(),
}
}
pub(crate) async fn copy_table_to(&self, req: CopyTableRequest) -> Result<Output> {
pub(crate) async fn copy_table_to(&self, req: CopyTableRequest) -> Result<usize> {
let table_ref = TableReference {
catalog: &req.catalog_name,
schema: &req.schema_name,
@@ -82,12 +82,25 @@ impl StatementExecutor {
let format = Format::try_from(&req.with).context(error::ParseFileFormatSnafu)?;
let stream = table
.scan(None, &[], None)
.await
.with_context(|_| error::CopyTableSnafu {
table_name: table_ref.to_string(),
})?;
let filters = table
.schema()
.timestamp_column()
.and_then(|c| {
common_query::logical_plan::build_filter_from_timestamp(
&c.name,
req.timestamp_range.as_ref(),
)
})
.into_iter()
.collect::<Vec<_>>();
let stream =
table
.scan(None, &filters, None)
.await
.with_context(|_| error::CopyTableSnafu {
table_name: table_ref.to_string(),
})?;
let stream = stream
.execute(0, SessionContext::default().task_ctx())
@@ -101,6 +114,6 @@ impl StatementExecutor {
.stream_to_file(stream, &format, object_store, &path)
.await?;
Ok(Output::AffectedRows(rows_copied))
Ok(rows_copied)
}
}

View File

@@ -382,6 +382,7 @@ impl DistTable {
schema_name,
table_name,
alter_kind,
table_id: _table_id,
} = request;
let alter_expr = context

View File

@@ -12,7 +12,7 @@ common-error = { path = "../common/error" }
common-grpc = { path = "../common/grpc" }
common-telemetry = { path = "../common/telemetry" }
common-meta = { path = "../common/meta" }
etcd-client = "0.10"
etcd-client = "0.11"
rand.workspace = true
serde.workspace = true
serde_json.workspace = true

View File

@@ -755,16 +755,21 @@ mod tests {
async fn test_batch_put() {
let tc = new_client("test_batch_put").await;
let req = BatchPutRequest::new()
.add_kv(tc.key("key"), b"value".to_vec())
.add_kv(tc.key("key2"), b"value2".to_vec());
let mut req = BatchPutRequest::new();
for i in 0..275 {
req = req.add_kv(
tc.key(&format!("key-{}", i)),
format!("value-{}", i).into_bytes(),
);
}
let res = tc.client.batch_put(req).await;
assert_eq!(0, res.unwrap().take_prev_kvs().len());
let req = RangeRequest::new().with_range(tc.key("key"), tc.key("key3"));
let req = RangeRequest::new().with_prefix(tc.key("key-"));
let res = tc.client.range(req).await;
let kvs = res.unwrap().take_kvs();
assert_eq!(2, kvs.len());
assert_eq!(275, kvs.len());
}
#[tokio::test]
@@ -772,16 +777,17 @@ mod tests {
let tc = new_client("test_batch_get").await;
tc.gen_data().await;
let req = BatchGetRequest::default()
.add_key(tc.key("key-1"))
.add_key(tc.key("key-2"));
let mut req = BatchGetRequest::default();
for i in 0..256 {
req = req.add_key(tc.key(&format!("key-{}", i)));
}
let mut res = tc.client.batch_get(req).await.unwrap();
assert_eq!(2, res.take_kvs().len());
assert_eq!(10, res.take_kvs().len());
let req = BatchGetRequest::default()
.add_key(tc.key("key-1"))
.add_key(tc.key("key-222"));
.add_key(tc.key("key-999"));
let mut res = tc.client.batch_get(req).await.unwrap();
assert_eq!(1, res.take_kvs().len());

View File

@@ -24,7 +24,7 @@ common-telemetry = { path = "../common/telemetry" }
common-time = { path = "../common/time" }
dashmap = "5.4"
derive_builder = "0.12"
etcd-client = "0.10"
etcd-client = "0.11"
futures.workspace = true
h2 = "0.3"
http-body = "0.4"
@@ -38,6 +38,7 @@ regex = "1.6"
serde = "1.0"
serde_json = "1.0"
snafu.workspace = true
store-api = { path = "../store-api" }
table = { path = "../table" }
tokio.workspace = true
tokio-stream = { version = "0.1", features = ["net"] }

View File

@@ -354,6 +354,12 @@ pub enum Error {
source: common_meta::error::Error,
},
#[snafu(display("Failed to convert proto data, source: {}", source))]
ConvertProtoData {
location: Location,
source: common_meta::error::Error,
},
// this error is used for custom error mapping
// please do not delete it
#[snafu(display("Other error, source: {}", source))]
@@ -442,7 +448,9 @@ impl ErrorExt for Error {
Error::RegionFailoverCandidatesNotFound { .. } => StatusCode::RuntimeResourcesExhausted,
Error::RegisterProcedureLoader { source, .. } => source.status_code(),
Error::TableRouteConversion { source, .. } => source.status_code(),
Error::TableRouteConversion { source, .. } | Error::ConvertProtoData { source, .. } => {
source.status_code()
}
Error::Other { source, .. } => source.status_code(),
}
}

View File

@@ -19,18 +19,18 @@ use std::time::Duration;
use api::v1::meta::mailbox_message::Payload;
use api::v1::meta::{
HeartbeatRequest, HeartbeatResponse, MailboxMessage, RequestHeader, ResponseHeader, Role,
PROTOCOL_VERSION,
HeartbeatRequest, HeartbeatResponse, MailboxMessage, RegionLease, RequestHeader,
ResponseHeader, Role, PROTOCOL_VERSION,
};
pub use check_leader_handler::CheckLeaderHandler;
pub use collect_stats_handler::CollectStatsHandler;
use common_meta::instruction::{Instruction, InstructionReply};
use common_telemetry::{debug, info, warn};
use common_telemetry::{debug, info, timer, warn};
use dashmap::DashMap;
pub use failure_handler::RegionFailureHandler;
pub use keep_lease_handler::KeepLeaseHandler;
use metrics::{decrement_gauge, increment_gauge};
pub use on_leader_start::OnLeaderStartHandler;
pub use on_leader_start_handler::OnLeaderStartHandler;
pub use persist_stats_handler::PersistStatsHandler;
pub use response_header_handler::ResponseHeaderHandler;
use snafu::{OptionExt, ResultExt};
@@ -40,7 +40,7 @@ use tokio::sync::{oneshot, Notify, RwLock};
use self::node_stat::Stat;
use crate::error::{self, DeserializeFromJsonSnafu, Result, UnexpectedInstructionReplySnafu};
use crate::metasrv::Context;
use crate::metrics::METRIC_META_HEARTBEAT_CONNECTION_NUM;
use crate::metrics::{METRIC_META_HANDLER_EXECUTE, METRIC_META_HEARTBEAT_CONNECTION_NUM};
use crate::sequence::Sequence;
use crate::service::mailbox::{
BroadcastChannel, Channel, Mailbox, MailboxReceiver, MailboxRef, MessageId,
@@ -52,14 +52,21 @@ pub(crate) mod failure_handler;
mod keep_lease_handler;
pub mod mailbox_handler;
pub mod node_stat;
mod on_leader_start;
mod on_leader_start_handler;
mod persist_stats_handler;
pub(crate) mod region_lease_handler;
mod response_header_handler;
#[async_trait::async_trait]
pub trait HeartbeatHandler: Send + Sync {
fn is_acceptable(&self, role: Role) -> bool;
fn name(&self) -> &'static str {
let type_name = std::any::type_name::<Self>();
// short name
type_name.split("::").last().unwrap_or(type_name)
}
async fn handle(
&self,
req: &HeartbeatRequest,
@@ -73,6 +80,7 @@ pub struct HeartbeatAccumulator {
pub header: Option<ResponseHeader>,
pub instructions: Vec<Instruction>,
pub stat: Option<Stat>,
pub region_leases: Vec<RegionLease>,
}
impl HeartbeatAccumulator {
@@ -130,6 +138,7 @@ impl Pushers {
.push(HeartbeatResponse {
header: Some(pusher.header()),
mailbox_message: Some(mailbox_message),
..Default::default()
})
.await
}
@@ -151,6 +160,7 @@ impl Pushers {
.push(HeartbeatResponse {
header: Some(pusher.header()),
mailbox_message: Some(mailbox_message),
..Default::default()
})
.await?;
}
@@ -167,9 +177,22 @@ impl Pushers {
}
}
struct NameCachedHandler {
name: &'static str,
handler: Box<dyn HeartbeatHandler>,
}
impl NameCachedHandler {
fn new(handler: impl HeartbeatHandler + 'static) -> Self {
let name = handler.name();
let handler = Box::new(handler);
Self { name, handler }
}
}
#[derive(Clone, Default)]
pub struct HeartbeatHandlerGroup {
handlers: Arc<RwLock<Vec<Box<dyn HeartbeatHandler>>>>,
handlers: Arc<RwLock<Vec<NameCachedHandler>>>,
pushers: Pushers,
}
@@ -183,7 +206,7 @@ impl HeartbeatHandlerGroup {
pub async fn add_handler(&self, handler: impl HeartbeatHandler + 'static) {
let mut handlers = self.handlers.write().await;
handlers.push(Box::new(handler));
handlers.push(NameCachedHandler::new(handler));
}
pub async fn register(&self, key: impl AsRef<str>, pusher: Pusher) {
@@ -219,19 +242,21 @@ impl HeartbeatHandlerGroup {
err_msg: format!("invalid role: {:?}", req.header),
})?;
for h in handlers.iter() {
for NameCachedHandler { name, handler } in handlers.iter() {
if ctx.is_skip_all() {
break;
}
if h.is_acceptable(role) {
h.handle(&req, &mut ctx, &mut acc).await?;
if handler.is_acceptable(role) {
let _timer = timer!(METRIC_META_HANDLER_EXECUTE, &[("name", *name)]);
handler.handle(&req, &mut ctx, &mut acc).await?;
}
}
let header = std::mem::take(&mut acc.header);
let res = HeartbeatResponse {
header,
mailbox_message: acc.into_mailbox_message(),
region_leases: acc.region_leases,
..Default::default()
};
Ok(res)
}
@@ -378,7 +403,11 @@ mod tests {
use api::v1::meta::{MailboxMessage, RequestHeader, Role, PROTOCOL_VERSION};
use tokio::sync::mpsc;
use crate::handler::{HeartbeatHandlerGroup, HeartbeatMailbox, Pusher};
use crate::handler::mailbox_handler::MailboxHandler;
use crate::handler::{
CheckLeaderHandler, CollectStatsHandler, HeartbeatHandlerGroup, HeartbeatMailbox,
OnLeaderStartHandler, PersistStatsHandler, Pusher, ResponseHeaderHandler,
};
use crate::sequence::Sequence;
use crate::service::mailbox::{Channel, MailboxReceiver, MailboxRef};
use crate::service::store::memory::MemStore;
@@ -447,4 +476,25 @@ mod tests {
(mailbox, receiver)
}
#[tokio::test]
async fn test_handler_name() {
let group = HeartbeatHandlerGroup::default();
group.add_handler(ResponseHeaderHandler::default()).await;
group.add_handler(CheckLeaderHandler::default()).await;
group.add_handler(OnLeaderStartHandler::default()).await;
group.add_handler(CollectStatsHandler::default()).await;
group.add_handler(MailboxHandler::default()).await;
group.add_handler(PersistStatsHandler::default()).await;
let handlers = group.handlers.read().await;
assert_eq!(6, handlers.len());
assert_eq!("ResponseHeaderHandler", handlers[0].handler.name());
assert_eq!("CheckLeaderHandler", handlers[1].handler.name());
assert_eq!("OnLeaderStartHandler", handlers[2].handler.name());
assert_eq!("CollectStatsHandler", handlers[3].handler.name());
assert_eq!("MailboxHandler", handlers[4].handler.name());
assert_eq!("PersistStatsHandler", handlers[5].handler.name());
}
}

View File

@@ -20,6 +20,7 @@ use crate::error::Result;
use crate::handler::{HeartbeatAccumulator, HeartbeatHandler};
use crate::metasrv::Context;
#[derive(Default)]
pub struct CollectStatsHandler;
#[async_trait::async_trait]

View File

@@ -19,7 +19,7 @@ use std::sync::Arc;
use api::v1::meta::{HeartbeatRequest, Role};
use async_trait::async_trait;
use common_catalog::consts::MITO_ENGINE;
use common_meta::instruction::TableIdent;
use common_meta::ident::TableIdent;
use common_meta::RegionIdent;
use table::engine::table_id;
@@ -36,6 +36,7 @@ pub(crate) struct DatanodeHeartbeat {
pub struct RegionFailureHandler {
failure_detect_runner: FailureDetectRunner,
region_failover_manager: Arc<RegionFailoverManager>,
}
impl RegionFailureHandler {
@@ -45,13 +46,19 @@ impl RegionFailureHandler {
) -> Result<Self> {
region_failover_manager.try_start()?;
let mut failure_detect_runner = FailureDetectRunner::new(election, region_failover_manager);
let mut failure_detect_runner =
FailureDetectRunner::new(election, region_failover_manager.clone());
failure_detect_runner.start().await;
Ok(Self {
failure_detect_runner,
region_failover_manager,
})
}
pub(crate) fn region_failover_manager(&self) -> &Arc<RegionFailoverManager> {
&self.region_failover_manager
}
}
#[async_trait]

View File

@@ -246,7 +246,7 @@ impl FailureDetectorContainer {
#[cfg(test)]
mod tests {
use common_catalog::consts::MITO_ENGINE;
use common_meta::instruction::TableIdent;
use common_meta::ident::TableIdent;
use rand::Rng;
use super::*;

View File

@@ -42,6 +42,8 @@ pub struct Stat {
pub write_io_rate: f64,
/// Region stats on this node
pub region_stats: Vec<RegionStat>,
// The node epoch is used to check whether the node has restarted or redeployed.
pub node_epoch: u64,
}
#[derive(Debug, Default, Serialize, Deserialize)]
@@ -79,6 +81,7 @@ impl TryFrom<HeartbeatRequest> for Stat {
is_leader,
node_stat,
region_stats,
node_epoch,
..
} = value;
@@ -104,6 +107,7 @@ impl TryFrom<HeartbeatRequest> for Stat {
read_io_rate: node_stat.read_io_rate,
write_io_rate: node_stat.write_io_rate,
region_stats: region_stats.into_iter().map(RegionStat::from).collect(),
node_epoch,
})
}
_ => Err(()),

View File

@@ -23,9 +23,47 @@ use crate::metasrv::Context;
const MAX_CACHED_STATS_PER_KEY: usize = 10;
#[derive(Default)]
struct EpochStats {
stats: Vec<Stat>,
epoch: Option<u64>,
}
impl EpochStats {
#[inline]
fn drain_all(&mut self) -> Vec<Stat> {
self.stats.drain(..).collect()
}
#[inline]
fn clear(&mut self) {
self.stats.clear();
}
#[inline]
fn push(&mut self, stat: Stat) {
self.stats.push(stat);
}
#[inline]
fn len(&self) -> usize {
self.stats.len()
}
#[inline]
fn epoch(&self) -> Option<u64> {
self.epoch
}
#[inline]
fn set_epoch(&mut self, epoch: u64) {
self.epoch = Some(epoch);
}
}
#[derive(Default)]
pub struct PersistStatsHandler {
stats_cache: DashMap<StatKey, Vec<Stat>>,
stats_cache: DashMap<StatKey, EpochStats>,
}
#[async_trait::async_trait]
@@ -40,26 +78,47 @@ impl HeartbeatHandler for PersistStatsHandler {
ctx: &mut Context,
acc: &mut HeartbeatAccumulator,
) -> Result<()> {
let Some(stat) = acc.stat.take() else { return Ok(()) };
let Some(current_stat) = acc.stat.take() else { return Ok(()) };
let key = stat.stat_key();
let key = current_stat.stat_key();
let mut entry = self
.stats_cache
.entry(key)
.or_insert_with(|| Vec::with_capacity(MAX_CACHED_STATS_PER_KEY));
let stats = entry.value_mut();
stats.push(stat);
.or_insert_with(EpochStats::default);
if stats.len() < MAX_CACHED_STATS_PER_KEY {
let key: Vec<u8> = key.into();
let epoch_stats = entry.value_mut();
let refresh = if let Some(epoch) = epoch_stats.epoch() {
// This node may have been redeployed.
if current_stat.node_epoch > epoch {
epoch_stats.set_epoch(current_stat.node_epoch);
epoch_stats.clear();
true
} else {
false
}
} else {
epoch_stats.set_epoch(current_stat.node_epoch);
// If the epoch is empty, it indicates that the current node sending the heartbeat
// for the first time to the current meta leader, so it is necessary to persist
// the data to the KV store as soon as possible.
true
};
epoch_stats.push(current_stat);
if !refresh && epoch_stats.len() < MAX_CACHED_STATS_PER_KEY {
return Ok(());
}
let stats = stats.drain(..).collect();
let val = StatValue { stats };
let value: Vec<u8> = StatValue {
stats: epoch_stats.drain_all(),
}
.try_into()?;
let put = PutRequest {
key: key.into(),
value: val.try_into()?,
key,
value,
..Default::default()
};
@@ -74,12 +133,11 @@ mod tests {
use std::sync::atomic::AtomicBool;
use std::sync::Arc;
use api::v1::meta::RangeRequest;
use super::*;
use crate::handler::{HeartbeatMailbox, Pushers};
use crate::keys::StatKey;
use crate::sequence::Sequence;
use crate::service::store::ext::KvStoreExt;
use crate::service::store::memory::MemStore;
#[tokio::test]
@@ -88,7 +146,7 @@ mod tests {
let kv_store = Arc::new(MemStore::new());
let seq = Sequence::new("test_seq", 0, 10, kv_store.clone());
let mailbox = HeartbeatMailbox::create(Pushers::default(), seq);
let mut ctx = Context {
let ctx = Context {
server_addr: "127.0.0.1:0000".to_string(),
in_memory,
kv_store,
@@ -98,9 +156,40 @@ mod tests {
is_infancy: false,
};
let req = HeartbeatRequest::default();
let handler = PersistStatsHandler::default();
for i in 1..=MAX_CACHED_STATS_PER_KEY {
handle_request_many_times(ctx.clone(), &handler, 1).await;
let key = StatKey {
cluster_id: 3,
node_id: 101,
};
let res = ctx.in_memory.get(key.try_into().unwrap()).await.unwrap();
assert!(res.is_some());
let kv = res.unwrap();
let key: StatKey = kv.key.clone().try_into().unwrap();
assert_eq!(3, key.cluster_id);
assert_eq!(101, key.node_id);
let val: StatValue = kv.value.try_into().unwrap();
// first new stat must be set in kv store immediately
assert_eq!(1, val.stats.len());
assert_eq!(Some(1), val.stats[0].region_num);
handle_request_many_times(ctx.clone(), &handler, 10).await;
let res = ctx.in_memory.get(key.try_into().unwrap()).await.unwrap();
assert!(res.is_some());
let kv = res.unwrap();
let val: StatValue = kv.value.try_into().unwrap();
// refresh every 10 stats
assert_eq!(10, val.stats.len());
}
async fn handle_request_many_times(
mut ctx: Context,
handler: &PersistStatsHandler,
loop_times: i32,
) {
let req = HeartbeatRequest::default();
for i in 1..=loop_times {
let mut acc = HeartbeatAccumulator {
stat: Some(Stat {
cluster_id: 3,
@@ -112,30 +201,5 @@ mod tests {
};
handler.handle(&req, &mut ctx, &mut acc).await.unwrap();
}
let key = StatKey {
cluster_id: 3,
node_id: 101,
};
let req = RangeRequest {
key: key.try_into().unwrap(),
..Default::default()
};
let res = ctx.in_memory.range(req).await.unwrap();
assert_eq!(1, res.kvs.len());
let kv = &res.kvs[0];
let key: StatKey = kv.key.clone().try_into().unwrap();
assert_eq!(3, key.cluster_id);
assert_eq!(101, key.node_id);
let val: StatValue = kv.value.clone().try_into().unwrap();
assert_eq!(10, val.stats.len());
assert_eq!(Some(1), val.stats[0].region_num);
}
}

View File

@@ -0,0 +1,226 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::HashMap;
use std::sync::Arc;
use api::v1::meta::{HeartbeatRequest, RegionLease, Role};
use async_trait::async_trait;
use catalog::helper::TableGlobalKey;
use common_meta::ident::TableIdent;
use common_meta::ClusterId;
use store_api::storage::RegionNumber;
use crate::error::Result;
use crate::handler::{HeartbeatAccumulator, HeartbeatHandler};
use crate::metasrv::Context;
use crate::procedure::region_failover::{RegionFailoverKey, RegionFailoverManager};
use crate::service::store::kv::KvStoreRef;
use crate::table_routes;
/// The lease seconds of a region. It's set by two default heartbeat intervals (5 second × 2) plus
/// two roundtrip time (2 second × 2 × 2), plus some extra buffer (2 second).
// TODO(LFC): Make region lease seconds calculated from Datanode heartbeat configuration.
pub(crate) const REGION_LEASE_SECONDS: u64 = 20;
pub(crate) struct RegionLeaseHandler {
kv_store: KvStoreRef,
region_failover_manager: Option<Arc<RegionFailoverManager>>,
}
impl RegionLeaseHandler {
pub(crate) fn new(
kv_store: KvStoreRef,
region_failover_manager: Option<Arc<RegionFailoverManager>>,
) -> Self {
Self {
kv_store,
region_failover_manager,
}
}
/// Filter out the regions that are currently in failover.
/// It's meaningless to extend the lease of a region if it is in failover.
fn filter_failover_regions(
&self,
cluster_id: ClusterId,
table_ident: &TableIdent,
regions: Vec<RegionNumber>,
) -> Vec<RegionNumber> {
if let Some(region_failover_manager) = &self.region_failover_manager {
let mut region_failover_key = RegionFailoverKey {
cluster_id,
table_ident: table_ident.clone(),
region_number: 0,
};
regions
.into_iter()
.filter(|region| {
region_failover_key.region_number = *region;
!region_failover_manager.is_region_failover_running(&region_failover_key)
})
.collect()
} else {
regions
}
}
}
#[async_trait]
impl HeartbeatHandler for RegionLeaseHandler {
fn is_acceptable(&self, role: Role) -> bool {
role == Role::Datanode
}
async fn handle(
&self,
req: &HeartbeatRequest,
_: &mut Context,
acc: &mut HeartbeatAccumulator,
) -> Result<()> {
let Some(stat) = acc.stat.as_ref() else { return Ok(()) };
let mut datanode_regions = HashMap::new();
stat.region_stats.iter().for_each(|x| {
let key = TableGlobalKey {
catalog_name: x.catalog.to_string(),
schema_name: x.schema.to_string(),
table_name: x.table.to_string(),
};
datanode_regions
.entry(key)
.or_insert_with(Vec::new)
.push(table::engine::region_number(x.id));
});
// TODO(LFC): Retrieve table global values from some cache here.
let table_global_values = table_routes::batch_get_table_global_value(
&self.kv_store,
datanode_regions.keys().collect::<Vec<_>>(),
)
.await?;
let mut region_leases = Vec::with_capacity(datanode_regions.len());
for (table_global_key, local_regions) in datanode_regions {
let Some(Some(table_global_value)) = table_global_values.get(&table_global_key) else { continue };
let Some(global_regions) = table_global_value.regions_id_map.get(&stat.id) else { continue };
// Filter out the designated regions from table global metadata for the given table on the given Datanode.
let designated_regions = local_regions
.into_iter()
.filter(|x| global_regions.contains(x))
.collect::<Vec<_>>();
let table_ident = TableIdent {
catalog: table_global_key.catalog_name.to_string(),
schema: table_global_key.schema_name.to_string(),
table: table_global_key.table_name.to_string(),
table_id: table_global_value.table_id(),
engine: table_global_value.engine().to_string(),
};
let designated_regions =
self.filter_failover_regions(stat.cluster_id, &table_ident, designated_regions);
region_leases.push(RegionLease {
table_ident: Some(table_ident.into()),
regions: designated_regions,
duration_since_epoch: req.duration_since_epoch,
lease_seconds: REGION_LEASE_SECONDS,
});
}
acc.region_leases = region_leases;
Ok(())
}
}
#[cfg(test)]
mod test {
use common_catalog::consts::{DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME};
use super::*;
use crate::handler::node_stat::{RegionStat, Stat};
use crate::metasrv::builder::MetaSrvBuilder;
use crate::test_util;
#[tokio::test]
async fn test_handle_region_lease() {
let region_failover_manager = test_util::create_region_failover_manager();
let kv_store = region_failover_manager
.create_context()
.selector_ctx
.kv_store
.clone();
let table_name = "my_table";
let _ = table_routes::tests::prepare_table_global_value(&kv_store, table_name).await;
let table_ident = TableIdent {
catalog: DEFAULT_CATALOG_NAME.to_string(),
schema: DEFAULT_SCHEMA_NAME.to_string(),
table: table_name.to_string(),
table_id: 1,
engine: "mito".to_string(),
};
region_failover_manager
.running_procedures()
.write()
.unwrap()
.insert(RegionFailoverKey {
cluster_id: 1,
table_ident: table_ident.clone(),
region_number: 1,
});
let handler = RegionLeaseHandler::new(kv_store, Some(region_failover_manager));
let req = HeartbeatRequest {
duration_since_epoch: 1234,
..Default::default()
};
let builder = MetaSrvBuilder::new();
let metasrv = builder.build().await.unwrap();
let ctx = &mut metasrv.new_ctx();
let acc = &mut HeartbeatAccumulator::default();
let new_region_stat = |region_id: u64| -> RegionStat {
RegionStat {
id: region_id,
catalog: DEFAULT_CATALOG_NAME.to_string(),
schema: DEFAULT_SCHEMA_NAME.to_string(),
table: table_name.to_string(),
..Default::default()
}
};
acc.stat = Some(Stat {
cluster_id: 1,
id: 1,
region_stats: vec![new_region_stat(1), new_region_stat(2), new_region_stat(3)],
..Default::default()
});
handler.handle(&req, ctx, acc).await.unwrap();
// region 1 is during failover and region 3 is not in table global value,
// so only region 2's lease is extended.
assert_eq!(acc.region_leases.len(), 1);
let lease = acc.region_leases.remove(0);
assert_eq!(lease.table_ident.unwrap(), table_ident.into());
assert_eq!(lease.regions, vec![2]);
assert_eq!(lease.duration_since_epoch, 1234);
assert_eq!(lease.lease_seconds, REGION_LEASE_SECONDS);
}
}

View File

@@ -88,6 +88,7 @@ mod tests {
let res = HeartbeatResponse {
header,
mailbox_message: acc.into_mailbox_message(),
..Default::default()
};
assert_eq!(1, res.header.unwrap().cluster_id);
}

View File

@@ -49,6 +49,7 @@ pub struct MetaSrvOptions {
pub datanode_lease_secs: i64,
pub selector: SelectorType,
pub use_memory_store: bool,
pub disable_region_failover: bool,
pub http_opts: HttpOptions,
pub logging: LoggingOptions,
}
@@ -62,6 +63,7 @@ impl Default for MetaSrvOptions {
datanode_lease_secs: 15,
selector: SelectorType::default(),
use_memory_store: false,
disable_region_failover: false,
http_opts: HttpOptions::default(),
logging: LoggingOptions::default(),
}

View File

@@ -20,6 +20,7 @@ use common_procedure::local::{LocalManager, ManagerConfig};
use crate::cluster::MetaPeerClient;
use crate::error::Result;
use crate::handler::mailbox_handler::MailboxHandler;
use crate::handler::region_lease_handler::RegionLeaseHandler;
use crate::handler::{
CheckLeaderHandler, CollectStatsHandler, HeartbeatHandlerGroup, HeartbeatMailbox,
KeepLeaseHandler, OnLeaderStartHandler, PersistStatsHandler, Pushers, RegionFailureHandler,
@@ -146,24 +147,36 @@ impl MetaSrvBuilder {
let handler_group = match handler_group {
Some(handler_group) => handler_group,
None => {
let region_failover_manager = Arc::new(RegionFailoverManager::new(
mailbox.clone(),
procedure_manager.clone(),
selector.clone(),
SelectorContext {
server_addr: options.server_addr.clone(),
datanode_lease_secs: options.datanode_lease_secs,
kv_store: kv_store.clone(),
catalog: None,
schema: None,
table: None,
},
lock.clone(),
));
let region_failover_handler = if options.disable_region_failover {
None
} else {
let region_failover_manager = Arc::new(RegionFailoverManager::new(
mailbox.clone(),
procedure_manager.clone(),
selector.clone(),
SelectorContext {
server_addr: options.server_addr.clone(),
datanode_lease_secs: options.datanode_lease_secs,
kv_store: kv_store.clone(),
catalog: None,
schema: None,
table: None,
},
lock.clone(),
));
let region_failure_handler =
RegionFailureHandler::try_new(election.clone(), region_failover_manager)
.await?;
Some(
RegionFailureHandler::try_new(election.clone(), region_failover_manager)
.await?,
)
};
let region_lease_handler = RegionLeaseHandler::new(
kv_store.clone(),
region_failover_handler
.as_ref()
.map(|x| x.region_failover_manager().clone()),
);
let group = HeartbeatHandlerGroup::new(pushers);
let keep_lease_handler = KeepLeaseHandler::new(kv_store.clone());
@@ -174,9 +187,12 @@ impl MetaSrvBuilder {
group.add_handler(keep_lease_handler).await;
group.add_handler(CheckLeaderHandler::default()).await;
group.add_handler(OnLeaderStartHandler::default()).await;
group.add_handler(CollectStatsHandler).await;
group.add_handler(MailboxHandler).await;
group.add_handler(region_failure_handler).await;
group.add_handler(CollectStatsHandler::default()).await;
group.add_handler(MailboxHandler::default()).await;
if let Some(region_failover_handler) = region_failover_handler {
group.add_handler(region_failover_handler).await;
}
group.add_handler(region_lease_handler).await;
group.add_handler(PersistStatsHandler::default()).await;
group
}

View File

@@ -17,3 +17,4 @@ pub(crate) const METRIC_META_CREATE_SCHEMA: &str = "meta.create_schema";
pub(crate) const METRIC_META_KV_REQUEST: &str = "meta.kv_request";
pub(crate) const METRIC_META_ROUTE_REQUEST: &str = "meta.route_request";
pub(crate) const METRIC_META_HEARTBEAT_CONNECTION_NUM: &str = "meta.heartbeat_connection_num";
pub(crate) const METRIC_META_HANDLER_EXECUTE: &str = "meta.handler_execute";

View File

@@ -21,12 +21,13 @@ mod update_metadata;
use std::collections::HashSet;
use std::fmt::Debug;
use std::sync::{Arc, Mutex};
use std::sync::{Arc, RwLock};
use std::time::Duration;
use async_trait::async_trait;
use catalog::helper::TableGlobalKey;
use common_meta::RegionIdent;
use common_meta::ident::TableIdent;
use common_meta::{ClusterId, RegionIdent};
use common_procedure::error::{
Error as ProcedureError, FromJsonSnafu, Result as ProcedureResult, ToJsonSnafu,
};
@@ -38,6 +39,7 @@ use common_telemetry::{error, info, warn};
use failover_start::RegionFailoverStart;
use serde::{Deserialize, Serialize};
use snafu::ResultExt;
use store_api::storage::RegionNumber;
use crate::error::{Error, RegisterProcedureLoaderSnafu, Result};
use crate::lock::DistLockRef;
@@ -48,26 +50,41 @@ use crate::service::store::ext::KvStoreExt;
const OPEN_REGION_MESSAGE_TIMEOUT: Duration = Duration::from_secs(30);
const CLOSE_REGION_MESSAGE_TIMEOUT: Duration = Duration::from_secs(2);
/// A key for the preventing running multiple failover procedures for the same region.
#[derive(PartialEq, Eq, Hash, Clone)]
pub(crate) struct RegionFailoverKey {
pub(crate) cluster_id: ClusterId,
pub(crate) table_ident: TableIdent,
pub(crate) region_number: RegionNumber,
}
impl From<RegionIdent> for RegionFailoverKey {
fn from(region_ident: RegionIdent) -> Self {
Self {
cluster_id: region_ident.cluster_id,
table_ident: region_ident.table_ident,
region_number: region_ident.region_number,
}
}
}
pub(crate) struct RegionFailoverManager {
mailbox: MailboxRef,
procedure_manager: ProcedureManagerRef,
selector: SelectorRef,
selector_ctx: SelectorContext,
dist_lock: DistLockRef,
running_procedures: Arc<Mutex<HashSet<RegionIdent>>>,
running_procedures: Arc<RwLock<HashSet<RegionFailoverKey>>>,
}
struct FailoverProcedureGuard<'a> {
running_procedures: Arc<Mutex<HashSet<RegionIdent>>>,
failed_region: &'a RegionIdent,
struct FailoverProcedureGuard {
running_procedures: Arc<RwLock<HashSet<RegionFailoverKey>>>,
key: RegionFailoverKey,
}
impl Drop for FailoverProcedureGuard<'_> {
impl Drop for FailoverProcedureGuard {
fn drop(&mut self) {
self.running_procedures
.lock()
.unwrap()
.remove(self.failed_region);
self.running_procedures.write().unwrap().remove(&self.key);
}
}
@@ -85,11 +102,11 @@ impl RegionFailoverManager {
selector,
selector_ctx,
dist_lock,
running_procedures: Arc::new(Mutex::new(HashSet::new())),
running_procedures: Arc::new(RwLock::new(HashSet::new())),
}
}
fn create_context(&self) -> RegionFailoverContext {
pub(crate) fn create_context(&self) -> RegionFailoverContext {
RegionFailoverContext {
mailbox: self.mailbox.clone(),
selector: self.selector.clone(),
@@ -113,19 +130,36 @@ impl RegionFailoverManager {
})
}
fn insert_running_procedures(&self, failed_region: &RegionIdent) -> bool {
let mut procedures = self.running_procedures.lock().unwrap();
if procedures.contains(failed_region) {
return false;
pub(crate) fn is_region_failover_running(&self, key: &RegionFailoverKey) -> bool {
self.running_procedures.read().unwrap().contains(key)
}
fn insert_running_procedures(
&self,
failed_region: &RegionIdent,
) -> Option<FailoverProcedureGuard> {
let key = RegionFailoverKey::from(failed_region.clone());
let mut procedures = self.running_procedures.write().unwrap();
if procedures.insert(key.clone()) {
Some(FailoverProcedureGuard {
running_procedures: self.running_procedures.clone(),
key,
})
} else {
None
}
procedures.insert(failed_region.clone())
}
#[cfg(test)]
pub(crate) fn running_procedures(&self) -> Arc<RwLock<HashSet<RegionFailoverKey>>> {
self.running_procedures.clone()
}
pub(crate) async fn do_region_failover(&self, failed_region: &RegionIdent) -> Result<()> {
if !self.insert_running_procedures(failed_region) {
let Some(guard) = self.insert_running_procedures(failed_region) else {
warn!("Region failover procedure for region {failed_region} is already running!");
return Ok(());
}
};
if !self.table_exists(failed_region).await? {
// The table could be dropped before the failure detector knows it. Then the region
@@ -142,13 +176,9 @@ impl RegionFailoverManager {
info!("Starting region failover procedure {procedure_id} for region {failed_region:?}");
let procedure_manager = self.procedure_manager.clone();
let running_procedures = self.running_procedures.clone();
let failed_region = failed_region.clone();
common_runtime::spawn_bg(async move {
let _guard = FailoverProcedureGuard {
running_procedures,
failed_region: &failed_region,
};
let _ = guard;
let watcher = &mut match procedure_manager.submit(procedure_with_id).await {
Ok(watcher) => watcher,
@@ -178,7 +208,7 @@ impl RegionFailoverManager {
let table_global_value = self
.selector_ctx
.kv_store
.get(table_global_key.to_string().into_bytes())
.get(table_global_key.to_raw_key())
.await?;
Ok(table_global_value.is_some())
}
@@ -232,7 +262,8 @@ trait State: Sync + Send + Debug {
/// │ │ │
/// └─────────┘ │ Sends "Close Region" request
/// │ to the failed Datanode, and
/// ┌─────────┐ wait for 2 seconds
/// | wait for the Region lease expiry
/// ┌─────────┐ │ seconds
/// │ │ │
/// │ ┌──▼────▼──────┐
/// Wait candidate │ │ActivateRegion◄───────────────────────┐
@@ -260,7 +291,6 @@ trait State: Sync + Send + Debug {
/// │ Broadcast Invalidate Table
/// │ Cache
/// │
/// │
/// ┌────────▼────────┐
/// │RegionFailoverEnd│
/// └─────────────────┘
@@ -343,7 +373,8 @@ mod tests {
use api::v1::meta::{HeartbeatResponse, MailboxMessage, Peer, RequestHeader};
use catalog::helper::TableGlobalKey;
use common_catalog::consts::{DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME, MITO_ENGINE};
use common_meta::instruction::{Instruction, InstructionReply, SimpleReply, TableIdent};
use common_meta::ident::TableIdent;
use common_meta::instruction::{Instruction, InstructionReply, SimpleReply};
use common_meta::DatanodeId;
use common_procedure::BoxedProcedure;
use rand::prelude::SliceRandom;

View File

@@ -28,6 +28,7 @@ use super::{RegionFailoverContext, State};
use crate::error::{
Error, Result, RetryLaterSnafu, SerializeToJsonSnafu, UnexpectedInstructionReplySnafu,
};
use crate::handler::region_lease_handler::REGION_LEASE_SECONDS;
use crate::handler::HeartbeatMailbox;
use crate::procedure::region_failover::CLOSE_REGION_MESSAGE_TIMEOUT;
use crate::service::mailbox::{Channel, MailboxReceiver};
@@ -35,11 +36,15 @@ use crate::service::mailbox::{Channel, MailboxReceiver};
#[derive(Serialize, Deserialize, Debug)]
pub(super) struct DeactivateRegion {
candidate: Peer,
region_lease_expiry_seconds: u64,
}
impl DeactivateRegion {
pub(super) fn new(candidate: Peer) -> Self {
Self { candidate }
Self {
candidate,
region_lease_expiry_seconds: REGION_LEASE_SECONDS * 2,
}
}
async fn send_close_region_message(
@@ -95,15 +100,21 @@ impl DeactivateRegion {
}
Err(e) if matches!(e, Error::MailboxTimeout { .. }) => {
// Since we are in a region failover situation, the Datanode that the failed region
// resides might be unreachable. So region deactivation is happened in a "try our
// best" effort, do not retry if mailbox received timeout.
// However, if the region failover procedure is also used in a planned maintenance
// situation in the future, a proper retry is a must.
// resides might be unreachable. So we wait for the region lease to expire. The
// region would be closed by its own [RegionAliveKeeper].
self.wait_for_region_lease_expiry().await;
Ok(Box::new(ActivateRegion::new(self.candidate)))
}
Err(e) => Err(e),
}
}
/// Sleep for `region_lease_expiry_seconds`, to make sure the region is closed (by its
/// region alive keeper). This is critical for region not being opened in multiple Datanodes
/// simultaneously.
async fn wait_for_region_lease_expiry(&self) {
tokio::time::sleep(Duration::from_secs(self.region_lease_expiry_seconds)).await;
}
}
#[async_trait]
@@ -120,8 +131,8 @@ impl State for DeactivateRegion {
let mailbox_receiver = match result {
Ok(mailbox_receiver) => mailbox_receiver,
Err(e) if matches!(e, Error::PusherNotFound { .. }) => {
// The Datanode could be unreachable and deregistered from pushers,
// so simply advancing to the next state here.
// See the mailbox received timeout situation comments above.
self.wait_for_region_lease_expiry().await;
return Ok(Box::new(ActivateRegion::new(self.candidate)));
}
Err(e) => return Err(e),
@@ -212,7 +223,10 @@ mod tests {
let mut env = TestingEnvBuilder::new().build().await;
let failed_region = env.failed_region(1).await;
let state = DeactivateRegion::new(Peer::new(2, ""));
let state = DeactivateRegion {
candidate: Peer::new(2, ""),
region_lease_expiry_seconds: 2,
};
let mailbox_receiver = state
.send_close_region_message(&env.context, &failed_region, Duration::from_millis(100))
.await

View File

@@ -14,7 +14,7 @@
use async_trait::async_trait;
use common_error::prelude::{ErrorExt, StatusCode};
use common_meta::instruction::TableIdent;
use common_meta::ident::TableIdent;
use common_meta::peer::Peer;
use common_meta::RegionIdent;
use common_telemetry::info;

View File

@@ -14,7 +14,8 @@
use api::v1::meta::MailboxMessage;
use async_trait::async_trait;
use common_meta::instruction::{Instruction, TableIdent};
use common_meta::ident::TableIdent;
use common_meta::instruction::Instruction;
use common_meta::RegionIdent;
use common_telemetry::info;
use serde::{Deserialize, Serialize};

View File

@@ -20,7 +20,7 @@ use api::v1::meta::{
heartbeat_server, AskLeaderRequest, AskLeaderResponse, HeartbeatRequest, HeartbeatResponse,
Peer, RequestHeader, ResponseHeader, Role,
};
use common_telemetry::{error, info, warn};
use common_telemetry::{debug, error, info, warn};
use futures::StreamExt;
use once_cell::sync::OnceCell;
use tokio::sync::mpsc;
@@ -59,6 +59,7 @@ impl heartbeat_server::Heartbeat for MetaSrv {
break;
}
};
debug!("Receiving heartbeat request: {:?}", req);
if pusher_key.is_none() {
let node_id = get_node_id(header);
@@ -76,6 +77,7 @@ impl heartbeat_server::Heartbeat for MetaSrv {
is_not_leader = res.as_ref().map_or(false, |r| r.is_not_leader());
debug!("Sending heartbeat response: {:?}", res);
tx.send(res).await.expect("working rx");
}
Err(err) => {

View File

@@ -24,6 +24,7 @@ use common_error::prelude::*;
use common_telemetry::{timer, warn};
use etcd_client::{
Client, Compare, CompareOp, DeleteOptions, GetOptions, PutOptions, Txn, TxnOp, TxnOpResponse,
TxnResponse,
};
use crate::error;
@@ -31,6 +32,12 @@ use crate::error::Result;
use crate::metrics::METRIC_META_KV_REQUEST;
use crate::service::store::kv::{KvStore, KvStoreRef};
// Maximum number of operations permitted in a transaction.
// The etcd default configuration's `--max-txn-ops` is 128.
//
// For more detail, see: https://etcd.io/docs/v3.5/op-guide/configuration/
const MAX_TXN_SIZE: usize = 128;
pub struct EtcdStore {
client: Client,
}
@@ -51,6 +58,32 @@ impl EtcdStore {
pub fn with_etcd_client(client: Client) -> Result<KvStoreRef> {
Ok(Arc::new(Self { client }))
}
async fn do_multi_txn(&self, txn_ops: Vec<TxnOp>) -> Result<Vec<TxnResponse>> {
if txn_ops.len() < MAX_TXN_SIZE {
// fast path
let txn = Txn::new().and_then(txn_ops);
let txn_res = self
.client
.kv_client()
.txn(txn)
.await
.context(error::EtcdFailedSnafu)?;
return Ok(vec![txn_res]);
}
let txns = txn_ops
.chunks(MAX_TXN_SIZE)
.map(|part| async move {
let txn = Txn::new().and_then(part);
self.client.kv_client().txn(txn).await
})
.collect::<Vec<_>>();
futures::future::try_join_all(txns)
.await
.context(error::EtcdFailedSnafu)
}
}
#[async_trait::async_trait]
@@ -142,23 +175,19 @@ impl KvStore for EtcdStore {
.into_iter()
.map(|k| TxnOp::get(k, options.clone()))
.collect();
let txn = Txn::new().and_then(get_ops);
let txn_res = self
.client
.kv_client()
.txn(txn)
.await
.context(error::EtcdFailedSnafu)?;
let txn_responses = self.do_multi_txn(get_ops).await?;
let mut kvs = vec![];
for op_res in txn_res.op_responses() {
let get_res = match op_res {
TxnOpResponse::Get(get_res) => get_res,
_ => unreachable!(),
};
for txn_res in txn_responses {
for op_res in txn_res.op_responses() {
let get_res = match op_res {
TxnOpResponse::Get(get_res) => get_res,
_ => unreachable!(),
};
kvs.extend(get_res.kvs().iter().map(KvPair::from_etcd_kv));
kvs.extend(get_res.kvs().iter().map(KvPair::from_etcd_kv));
}
}
let header = Some(ResponseHeader::success(cluster_id));
@@ -185,24 +214,20 @@ impl KvStore for EtcdStore {
.into_iter()
.map(|kv| (TxnOp::put(kv.key, kv.value, options.clone())))
.collect::<Vec<_>>();
let txn = Txn::new().and_then(put_ops);
let txn_res = self
.client
.kv_client()
.txn(txn)
.await
.context(error::EtcdFailedSnafu)?;
let txn_responses = self.do_multi_txn(put_ops).await?;
let mut prev_kvs = vec![];
for op_res in txn_res.op_responses() {
match op_res {
TxnOpResponse::Put(put_res) => {
if let Some(prev_kv) = put_res.prev_key() {
prev_kvs.push(KvPair::from_etcd_kv(prev_kv));
for txn_res in txn_responses {
for op_res in txn_res.op_responses() {
match op_res {
TxnOpResponse::Put(put_res) => {
if let Some(prev_kv) = put_res.prev_key() {
prev_kvs.push(KvPair::from_etcd_kv(prev_kv));
}
}
_ => unreachable!(),
}
_ => unreachable!(), // never get here
}
}
@@ -232,28 +257,23 @@ impl KvStore for EtcdStore {
.into_iter()
.map(|k| TxnOp::delete(k, options.clone()))
.collect::<Vec<_>>();
let txn = Txn::new().and_then(delete_ops);
let txn_res = self
.client
.kv_client()
.txn(txn)
.await
.context(error::EtcdFailedSnafu)?;
let txn_responses = self.do_multi_txn(delete_ops).await?;
for op_res in txn_res.op_responses() {
match op_res {
TxnOpResponse::Delete(delete_res) => {
delete_res.prev_kvs().iter().for_each(|kv| {
prev_kvs.push(KvPair::from_etcd_kv(kv));
});
for txn_res in txn_responses {
for op_res in txn_res.op_responses() {
match op_res {
TxnOpResponse::Delete(delete_res) => {
delete_res.prev_kvs().iter().for_each(|kv| {
prev_kvs.push(KvPair::from_etcd_kv(kv));
});
}
_ => unreachable!(),
}
_ => unreachable!(), // never get here
}
}
let header = Some(ResponseHeader::success(cluster_id));
Ok(BatchDeleteResponse { header, prev_kvs })
}
@@ -308,7 +328,7 @@ impl KvStore for EtcdStore {
let prev_kv = match op_res {
TxnOpResponse::Put(res) => res.prev_key().map(KvPair::from_etcd_kv),
TxnOpResponse::Get(res) => res.kvs().first().map(KvPair::from_etcd_kv),
_ => unreachable!(), // never get here
_ => unreachable!(),
};
let header = Some(ResponseHeader::success(cluster_id));

View File

@@ -12,7 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use api::v1::meta::{KeyValue, RangeRequest};
use api::v1::meta::{DeleteRangeRequest, KeyValue, RangeRequest};
use crate::error::Result;
use crate::service::store::kv::KvStore;
@@ -24,6 +24,10 @@ pub trait KvStoreExt {
/// Check if a key exists, it does not return the value.
async fn exists(&self, key: Vec<u8>) -> Result<bool>;
/// Delete the value by the given key. If prev_kv is true,
/// the previous key-value pairs will be returned.
async fn delete(&self, key: Vec<u8>, prev_kv: bool) -> Result<Option<KeyValue>>;
}
#[async_trait::async_trait]
@@ -53,6 +57,18 @@ where
Ok(!kvs.is_empty())
}
async fn delete(&self, key: Vec<u8>, prev_kv: bool) -> Result<Option<KeyValue>> {
let req = DeleteRangeRequest {
key,
prev_kv,
..Default::default()
};
let mut prev_kvs = self.delete_range(req).await?.prev_kvs;
Ok(prev_kvs.pop())
}
}
#[cfg(test)]
@@ -115,6 +131,31 @@ mod tests {
assert!(!in_mem.exists("test_key".as_bytes().to_vec()).await.unwrap());
}
#[tokio::test]
async fn test_delete() {
let mut in_mem = Arc::new(MemStore::new()) as KvStoreRef;
let mut prev_kv = in_mem
.delete("test_key1".as_bytes().to_vec(), true)
.await
.unwrap();
assert!(prev_kv.is_none());
put_stats_to_store(&mut in_mem).await;
assert!(in_mem
.exists("test_key1".as_bytes().to_vec())
.await
.unwrap());
prev_kv = in_mem
.delete("test_key1".as_bytes().to_vec(), true)
.await
.unwrap();
assert!(prev_kv.is_some());
assert_eq!("test_key1".as_bytes(), prev_kv.unwrap().key);
}
async fn put_stats_to_store(store: &mut KvStoreRef) {
store
.put(PutRequest {

View File

@@ -12,13 +12,17 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::HashMap;
use api::v1::meta::{PutRequest, TableRouteValue};
use catalog::helper::{TableGlobalKey, TableGlobalValue};
use common_meta::key::TableRouteKey;
use common_meta::rpc::store::{BatchGetRequest, BatchGetResponse};
use snafu::{OptionExt, ResultExt};
use crate::error::{
DecodeTableRouteSnafu, InvalidCatalogValueSnafu, Result, TableRouteNotFoundSnafu,
ConvertProtoDataSnafu, DecodeTableRouteSnafu, InvalidCatalogValueSnafu, Result,
TableRouteNotFoundSnafu,
};
use crate::service::store::ext::KvStoreExt;
use crate::service::store::kv::KvStoreRef;
@@ -27,12 +31,40 @@ pub async fn get_table_global_value(
kv_store: &KvStoreRef,
key: &TableGlobalKey,
) -> Result<Option<TableGlobalValue>> {
let key = key.to_string().into_bytes();
let kv = kv_store.get(key).await?;
let kv = kv_store.get(key.to_raw_key()).await?;
kv.map(|kv| TableGlobalValue::from_bytes(kv.value).context(InvalidCatalogValueSnafu))
.transpose()
}
pub(crate) async fn batch_get_table_global_value(
kv_store: &KvStoreRef,
keys: Vec<&TableGlobalKey>,
) -> Result<HashMap<TableGlobalKey, Option<TableGlobalValue>>> {
let req = BatchGetRequest {
keys: keys.iter().map(|x| x.to_raw_key()).collect::<Vec<_>>(),
};
let mut resp: BatchGetResponse = kv_store
.batch_get(req.into())
.await?
.try_into()
.context(ConvertProtoDataSnafu)?;
let kvs = resp.take_kvs();
let mut result = HashMap::with_capacity(kvs.len());
for kv in kvs {
let key = TableGlobalKey::try_from_raw_key(kv.key()).context(InvalidCatalogValueSnafu)?;
let value = TableGlobalValue::from_bytes(kv.value()).context(InvalidCatalogValueSnafu)?;
result.insert(key, Some(value));
}
for key in keys {
if !result.contains_key(key) {
result.insert(key.clone(), None);
}
}
Ok(result)
}
pub(crate) async fn put_table_global_value(
kv_store: &KvStoreRef,
key: &TableGlobalKey,
@@ -40,7 +72,7 @@ pub(crate) async fn put_table_global_value(
) -> Result<()> {
let req = PutRequest {
header: None,
key: key.to_string().into_bytes(),
key: key.to_raw_key(),
value: value.as_bytes().context(InvalidCatalogValueSnafu)?,
prev_kv: false,
};
@@ -228,12 +260,12 @@ pub(crate) mod tests {
async fn test_put_and_get_table_global_value() {
let kv_store = Arc::new(MemStore::new()) as _;
let key = TableGlobalKey {
let not_exist_key = TableGlobalKey {
catalog_name: "not_exist_catalog".to_string(),
schema_name: "not_exist_schema".to_string(),
table_name: "not_exist_table".to_string(),
};
assert!(get_table_global_value(&kv_store, &key)
assert!(get_table_global_value(&kv_store, &not_exist_key)
.await
.unwrap()
.is_none());
@@ -244,6 +276,12 @@ pub(crate) mod tests {
.unwrap()
.unwrap();
assert_eq!(actual, value);
let keys = vec![&not_exist_key, &key];
let result = batch_get_table_global_value(&kv_store, keys).await.unwrap();
assert_eq!(result.len(), 2);
assert!(result.get(&not_exist_key).unwrap().is_none());
assert_eq!(result.get(&key).unwrap().as_ref().unwrap(), &value);
}
#[tokio::test]

View File

@@ -42,8 +42,7 @@ use table::engine::{
};
use table::metadata::{TableId, TableInfo, TableVersion};
use table::requests::{
AlterKind, AlterTableRequest, CloseTableRequest, CreateTableRequest, DropTableRequest,
OpenTableRequest,
AlterTableRequest, CloseTableRequest, CreateTableRequest, DropTableRequest, OpenTableRequest,
};
use table::{error as table_error, Result as TableResult, Table, TableRef};
@@ -102,9 +101,8 @@ impl<S: StorageEngine> TableEngine for MitoEngine<S> {
.map_err(BoxedError::new)
.context(table_error::TableOperationSnafu)?;
let table_ref = request.table_ref();
let _lock = self.inner.table_mutex.lock(table_ref.to_string()).await;
if let Some(table) = self.inner.get_mito_table(&table_ref) {
let _lock = self.inner.table_mutex.lock(request.id).await;
if let Some(table) = self.inner.get_mito_table(request.id) {
if request.create_if_not_exists {
return Ok(table);
} else {
@@ -148,26 +146,10 @@ impl<S: StorageEngine> TableEngine for MitoEngine<S> {
) -> TableResult<TableRef> {
let _timer = common_telemetry::timer!(metrics::MITO_ALTER_TABLE_ELAPSED);
if let AlterKind::RenameTable { new_table_name } = &req.alter_kind {
let mut table_ref = req.table_ref();
table_ref.table = new_table_name;
if self.inner.get_mito_table(&table_ref).is_some() {
return TableExistsSnafu {
table_name: table_ref.to_string(),
}
.fail()
.map_err(BoxedError::new)
.context(table_error::TableOperationSnafu)?;
}
}
let mut procedure = AlterMitoTable::new(req, self.inner.clone())
.map_err(BoxedError::new)
.context(table_error::TableOperationSnafu)?;
// TODO(yingwen): Rename has concurrent issue without the procedure runtime. But
// users can't use this method to alter a table so it is still safe. We should
// refactor the table engine to avoid using table name as key.
procedure
.engine_alter_table()
.await
@@ -175,16 +157,12 @@ impl<S: StorageEngine> TableEngine for MitoEngine<S> {
.context(table_error::TableOperationSnafu)
}
fn get_table(
&self,
_ctx: &EngineContext,
table_ref: &TableReference,
) -> TableResult<Option<TableRef>> {
Ok(self.inner.get_table(table_ref))
fn get_table(&self, _ctx: &EngineContext, table_id: TableId) -> TableResult<Option<TableRef>> {
Ok(self.inner.get_table(table_id))
}
fn table_exists(&self, _ctx: &EngineContext, table_ref: &TableReference) -> bool {
self.inner.get_table(table_ref).is_some()
fn table_exists(&self, _ctx: &EngineContext, table_id: TableId) -> bool {
self.inner.get_table(table_id).is_some()
}
async fn drop_table(
@@ -254,16 +232,16 @@ impl<S: StorageEngine> TableEngineProcedure for MitoEngine<S> {
}
pub(crate) struct MitoEngineInner<S: StorageEngine> {
/// All tables opened by the engine. Map key is formatted [TableReference].
/// All tables opened by the engine.
///
/// Writing to `tables` should also hold the `table_mutex`.
tables: DashMap<String, Arc<MitoTable<S::Region>>>,
tables: DashMap<TableId, Arc<MitoTable<S::Region>>>,
object_store: ObjectStore,
compress_type: CompressionType,
storage_engine: S,
/// Table mutex is used to protect the operations such as creating/opening/closing
/// a table, to avoid things like opening the same table simultaneously.
table_mutex: Arc<KeyLock<String>>,
table_mutex: Arc<KeyLock<TableId>>,
}
fn build_row_key_desc(
@@ -429,11 +407,6 @@ impl<S: StorageEngine> MitoEngineInner<S> {
let catalog_name = &request.catalog_name;
let schema_name = &request.schema_name;
let table_name = &request.table_name;
let table_ref = TableReference {
catalog: catalog_name,
schema: schema_name,
table: table_name,
};
let table_id = request.table_id;
let engine_ctx = StorageEngineContext::default();
@@ -452,7 +425,6 @@ impl<S: StorageEngine> MitoEngineInner<S> {
.write_buffer_size
.map(|s| s.0 as usize),
ttl: table_info.meta.options.ttl,
compaction_time_window: table_info.meta.options.compaction_time_window,
};
debug!(
@@ -464,6 +436,11 @@ impl<S: StorageEngine> MitoEngineInner<S> {
let mut regions = HashMap::with_capacity(table_info.meta.region_numbers.len());
let table_ref = TableReference {
catalog: catalog_name,
schema: schema_name,
table: table_name,
};
for region_number in &request.region_numbers {
let region = self
.open_region(&engine_ctx, table_id, *region_number, &table_ref, &opts)
@@ -532,7 +509,6 @@ impl<S: StorageEngine> MitoEngineInner<S> {
.write_buffer_size
.map(|s| s.0 as usize),
ttl: table_info.meta.options.ttl,
compaction_time_window: table_info.meta.options.compaction_time_window,
};
// TODO(weny): Returns an error earlier if the target region does not exist in the meta.
@@ -558,16 +534,7 @@ impl<S: StorageEngine> MitoEngineInner<S> {
ctx: &EngineContext,
request: OpenTableRequest,
) -> TableResult<Option<TableRef>> {
let catalog_name = &request.catalog_name;
let schema_name = &request.schema_name;
let table_name = &request.table_name;
let table_ref = TableReference {
catalog: catalog_name,
schema: schema_name,
table: table_name,
};
if let Some(table) = self.get_table(&table_ref) {
if let Some(table) = self.get_table(request.table_id) {
if let Some(table) = self.check_regions(table, &request.region_numbers)? {
return Ok(Some(table));
}
@@ -575,11 +542,10 @@ impl<S: StorageEngine> MitoEngineInner<S> {
// Acquires the mutex before opening a new table.
let table = {
let table_name_key = table_ref.to_string();
let _lock = self.table_mutex.lock(table_name_key.clone()).await;
let _lock = self.table_mutex.lock(request.table_id).await;
// Checks again, read lock should be enough since we are guarded by the mutex.
if let Some(table) = self.get_mito_table(&table_ref) {
if let Some(table) = self.get_mito_table(request.table_id) {
// Contains all regions or target region
if let Some(table) = self.check_regions(table.clone(), &request.region_numbers)? {
Some(table)
@@ -595,7 +561,7 @@ impl<S: StorageEngine> MitoEngineInner<S> {
let table = self.recover_table(ctx, request.clone()).await?;
if let Some(table) = table {
// already locked
self.tables.insert(table_ref.to_string(), table.clone());
self.tables.insert(request.table_id, table.clone());
Some(table as _)
} else {
@@ -606,8 +572,8 @@ impl<S: StorageEngine> MitoEngineInner<S> {
logging::info!(
"Mito engine opened table: {} in schema: {}",
table_name,
schema_name
request.table_name,
request.schema_name
);
Ok(table)
@@ -615,10 +581,8 @@ impl<S: StorageEngine> MitoEngineInner<S> {
async fn drop_table(&self, request: DropTableRequest) -> TableResult<bool> {
// Remove the table from the engine to avoid further access from users.
let table_ref = request.table_ref();
let _lock = self.table_mutex.lock(table_ref.to_string()).await;
let removed_table = self.tables.remove(&table_ref.to_string());
let _lock = self.table_mutex.lock(request.table_id).await;
let removed_table = self.tables.remove(&request.table_id);
// Close the table to close all regions. Closing a region is idempotent.
if let Some((_, table)) = &removed_table {
let regions = table.region_ids();
@@ -665,17 +629,13 @@ impl<S: StorageEngine> MitoEngineInner<S> {
Ok(Some((manifest, table_info)))
}
fn get_table(&self, table_ref: &TableReference) -> Option<TableRef> {
self.tables
.get(&table_ref.to_string())
.map(|en| en.value().clone() as _)
fn get_table(&self, table_id: TableId) -> Option<TableRef> {
self.tables.get(&table_id).map(|en| en.value().clone() as _)
}
/// Returns the [MitoTable].
fn get_mito_table(&self, table_ref: &TableReference) -> Option<Arc<MitoTable<S::Region>>> {
self.tables
.get(&table_ref.to_string())
.map(|en| en.value().clone())
fn get_mito_table(&self, table_id: TableId) -> Option<Arc<MitoTable<S::Region>>> {
self.tables.get(&table_id).map(|en| en.value().clone())
}
async fn close(&self) -> TableResult<()> {
@@ -698,8 +658,7 @@ impl<S: StorageEngine> MitoEngineInner<S> {
}
async fn close_table(&self, request: CloseTableRequest) -> TableResult<CloseTableResult> {
let table_ref = request.table_ref();
if let Some(table) = self.get_mito_table(&table_ref) {
if let Some(table) = self.get_mito_table(request.table_id) {
return self
.close_table_inner(table, Some(&request.region_numbers), request.flush)
.await;
@@ -715,13 +674,8 @@ impl<S: StorageEngine> MitoEngineInner<S> {
flush: bool,
) -> TableResult<CloseTableResult> {
let info = table.table_info();
let table_ref = TableReference {
catalog: &info.catalog_name,
schema: &info.schema_name,
table: &info.name,
};
let table_id = info.ident.table_id;
let _lock = self.table_mutex.lock(table_ref.to_string()).await;
let _lock = self.table_mutex.lock(table_id).await;
let all_regions = table.region_ids();
let regions = regions.unwrap_or(&all_regions);
@@ -740,12 +694,12 @@ impl<S: StorageEngine> MitoEngineInner<S> {
}
if table.is_releasable() {
self.tables.remove(&table_ref.to_string());
self.tables.remove(&table_id);
logging::info!(
"Mito engine closed table: {} in schema: {}",
table_ref.table,
table_ref.schema,
info.name,
info.schema_name,
);
return Ok(CloseTableResult::Released(removed_regions));
}

Some files were not shown because too many files have changed in this diff Show More