chore: cherry pick fixes and bum version to v1.0.1 (#8024)

* fix: remap peer addresses during retries (#7933)

* fix: remap peer addresses during retries

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: styling

Signed-off-by: WenyXu <wenymedia@gmail.com>

* test: add tests

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: apply suggestions from CR

Signed-off-by: WenyXu <wenymedia@gmail.com>

---------

Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: using uint64 datatype for postgres prepared statement parameters (#7942)

* feat: add support for decimal parameter type, remove string replacement fallback

* chore: format

* fix: add support for using unsigned bigint in postgres

* chore: format toml

* refactor: cleanup duplicated code

* fix: rescale decimal

Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: fix current version comparison logic for pre-releases (#7946)

Signed-off-by: liyang <daviderli614@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix(index): intersect bitmaps before early exit in predicates applier (#7867)

* fix(index): intersect bitmaps before early exit in predicates applier

The loop skipped intersecting when the next bitmap was empty, which left
the accumulator unchanged instead of zeroing it. Intersect first, then
break when the result is empty.

Signed-off-by: Weixie Cui <cuiweixie@gmail.com>

* per gemini

* style(index): format predicates applier loop

* fix(index): remove unused mut in predicates applier

---------

Signed-off-by: Weixie Cui <cuiweixie@gmail.com>
Co-authored-by: discord9 <55937128+discord9@users.noreply.github.com>
Co-authored-by: discord9 <discord9@163.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: randomize standalone test ports in cli export test (#7955)

fix/flaky-test:
 ### Add Dynamic Port Selection for Standalone Tests

 - **`cli.rs`**: Implemented functions `random_standalone_addrs` and `choose_random_unused_port_offset` to dynamically select unused ports for standalone tests, enhancing test reliability.
 - Updated `test_export_create_table_with_quoted_names` to use dynamically assigned ports for HTTP, RPC, MySQL, and PostgreSQL addresses.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: fix git cliff errors in latest version (#7947)

* chore: fix git cliff errors in latest version

- Fix errors in v2.12.0
- Do not generate logs for beta/rc tags between the compared commits

Signed-off-by: evenyag <realevenyag@gmail.com>

* chore: preserve blank line before release date in changelog

Signed-off-by: evenyag <realevenyag@gmail.com>

---------

Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: match term zh (#7952)

* fix: match term zh

Signed-off-by: discord9 <discord9@163.com>

* chore: per gemini

Signed-off-by: discord9 <discord9@163.com>

* chore: revert accident change

Signed-off-by: discord9 <discord9@163.com>

* feat: unicode script han

Signed-off-by: discord9 <discord9@163.com>

---------

Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* ci: set upload timeout for uploading artifacts to S3 (#7958)

* ci: set upload timeout for uploading artifacts to S3

Signed-off-by: liyang <daviderli614@gmail.com>

* Update upload-artifacts-to-s3.sh

---------

Signed-off-by: liyang <daviderli614@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: cargo check -p common-meta (#7964)

fix: moka feature
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: always skip field pruning when using merge mode (#7957)

* test: add prefilter regressions for last_row null filters

Signed-off-by: evenyag <realevenyag@gmail.com>

* fix: skip fields in all merge mode

Signed-off-by: evenyag <realevenyag@gmail.com>

* refactor: simplify pre-filter skip fields handling

Signed-off-by: evenyag <realevenyag@gmail.com>

* test: update test

Signed-off-by: evenyag <realevenyag@gmail.com>

---------

Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: mysql prepare correctly returns error instead of panic (#7963)

feat: mysql writer support multiple statement execution

Signed-off-by: luofucong <luofc@foxmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: relax azblob validation requirements (#7970)

Signed-off-by: WenyXu <wenymedia@gmail.com>

* feat(mito2): allow CompactionOutput to succeed independently (#7948)

* refactor(mito2): improve compaction error handling and file removal

Refactor compaction task execution to enhance error handling and robustness.
- Implemented parallel execution of compaction tasks with proper error capture and logging for individual task failures.
- Ensured JoinSnafu is no longer directly used in error propagation, instead handling errors within the task processing loop.
- Adjusted file removal logic to correctly include expired SSTs after compaction merges.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* refactor(mito2): extract SstMerger trait for testability in compaction

Extract SstMerger trait and DefaultSstMerger implementation to improve the testability of DefaultCompactor.

The DefaultCompactor is now generic over SstMerger, allowing mock implementations to be injected for unit testing without relying on the full object storage access layer. This refactoring separates the concerns of SST file merging from the overall compaction orchestration logic.

Additionally:
- Updated CompactionScheduler to use DefaultCompactor::default().
- Added unit tests for DefaultCompactor using a MockMerger.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* fix(compaction): propagate join error during sst flush

Correctly propagates the error when joining SST flush handles during compaction. Previously, the error was logged but not returned, leading to potential silent failures.
Also reorders some imports for consistency.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* perf(compaction): pre-allocate capacity for compacted_inputs

Pre-allocates capacity for the compacted_inputs vector based on the estimated total size of inputs and expired SSTs. This optimization aims to reduce vector reallocations during the compaction process.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-partial-compaction:
 ### Commit Message

 Enhance `DefaultCompactor` and `MockMerger` for Improved Flexibility

 - **`compactor.rs`**:
   - Added `Clone` trait to `DefaultSstMerger` and `MockMerger` to allow cloning.
   - Removed `Arc` wrapping from `DefaultCompactor`'s `merger` field for direct usage.
   - Updated `merge_ssts` method to require `Clone` trait for `SstMerger`.
   - Modified `MockMerger` to use `Arc<Mutex>` for `results` and `call_idx` to ensure thread safety.
   - Adjusted error handling to use `error::InvalidMetaSnafu` directly.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

---------

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* feat: propagate staging leader through lease and heartbeat (#7950)

* feat(mito): expose staging leader role state

* fix(region): clear staging metadata on leader exit

* feat: propagate staging leader role through heartbeat and metasrv

* chore: update comments

Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix(region): unify staging exit role transitions

* chore: update proto

Signed-off-by: WenyXu <wenymedia@gmail.com>

---------

Signed-off-by: WenyXu <wenymedia@gmail.com>

* feat: cancel local compaction for enter staging (#7885)

* feat(mito2): support cancelling active local compaction

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: apply suggestions from CR

Signed-off-by: WenyXu <wenymedia@gmail.com>

* test(mito2): cover compaction cancellation return paths

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: apply suggestions from CR

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: apply suggestions from CR

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: apply suggestions from CR

Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: cancel remaining tasks

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: apply suggestions

Signed-off-by: WenyXu <wenymedia@gmail.com>

---------

Signed-off-by: WenyXu <wenymedia@gmail.com>

* refactor: move group rollback ownership to parent repartition (#7967)

* refactor(meta-srv): move group rollback ownership to parent repartition procedure

- Parent procedure now owns partial rollback based on failed/unknown subprocedures
- rollback order: group metadata first, then allocated-region cleanup
- original_target_routes captured during build-plan, persisted in RepartitionPlanEntry
- rollback_group_metadata_routes moved to utils as parent-owned helper
- Group subprocedure no longer supports rollback (rollback_supported = false)
- Removed UpdateMetadata::RollbackStaging from group state machine
- Deleted redundant group rollback tests and helpers

BREAKING CHANGE: group Procedure no longer handles rollback; parent procedure
is responsible for crash recovery and selecting which plans to roll back.

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: update comments

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: apply suggestions from CR

Signed-off-by: WenyXu <wenymedia@gmail.com>

---------

Signed-off-by: WenyXu <wenymedia@gmail.com>

* feat: use PreFilterMode::All if only one source in the partition range (#7973)

* feat: use PrefilterMode::All if only one source in the partition range

Signed-off-by: evenyag <realevenyag@gmail.com>

* fix: consider append_mode

Signed-off-by: evenyag <realevenyag@gmail.com>

* chore: skip merge if only one source

Signed-off-by: evenyag <realevenyag@gmail.com>

* test: fix test

Signed-off-by: evenyag <realevenyag@gmail.com>

---------

Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix(meta): renew operating region leases from keeper roles (#7971)

* refactor(meta): store operating region roles in memory keeper

Signed-off-by: WenyXu <wenymedia@gmail.com>

* refactor(meta): register operating region roles from region routes

Signed-off-by: WenyXu <wenymedia@gmail.com>

* refactor(meta): require explicit operating region roles

Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix(meta): renew operating region leases from keeper roles

Signed-off-by: WenyXu <wenymedia@gmail.com>

* test(common-meta): cover region route role helpers

Signed-off-by: WenyXu <wenymedia@gmail.com>

* test(meta): cover operating region role propagation

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: apply suggestions

Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: apply suggestions

Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix(meta): prefer metadata roles for region lease renewal

Signed-off-by: WenyXu <wenymedia@gmail.com>

---------

Signed-off-by: WenyXu <wenymedia@gmail.com>

* feat: add an index page (#7975)

* feat: include an index page

* fix: address code review

* fix: let / auth gated

* refactor: rename public-apis to public-api-prefix

Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: remove redundant error messages in admin functions (#7953)

Closes #7938

Signed-off-by: yxrxy <yxrxytrigger@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* perf: better jieba cut (#7984)

* perf: better jieba cut

Signed-off-by: discord9 <discord9@163.com>

* fix: also filter pun mark

Signed-off-by: discord9 <discord9@163.com>

* chore

Signed-off-by: discord9 <discord9@163.com>

* docs: explain why

Signed-off-by: discord9 <discord9@163.com>

---------

Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: allow ipv4_num_to_string to accept valid integers (#7994)

* fix: allow ipv4_num_to_string to accept valid integers

Signed-off-by: Johannes Sluis <joesluis51@gmail.com>

* test: update sqlness result file

Signed-off-by: Johannes Sluis <joesluis51@gmail.com>

* fix: use coercible integer signature for ipv4_num_to_string

Signed-off-by: Johannes Sluis <joesluis51@gmail.com>

---------

Signed-off-by: Johannes Sluis <joesluis51@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: update manifest state before deleting delta files (#8001)

* fix: update state before deleting deltas

Signed-off-by: evenyag <realevenyag@gmail.com>

* chore: update comment

Signed-off-by: evenyag <realevenyag@gmail.com>

* chore: update log level

Signed-off-by: evenyag <realevenyag@gmail.com>

---------

Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: upgrade mysql metadata value limit to mediumblob (#7985)

* fix: upgrade mysql metadata values to mediumblob

* fix: fail mysql metadata startup on upgrade check errors

Signed-off-by: WenyXu <wenymedia@gmail.com>

---------

Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: zh same underscore behavior (#8002)

* fix: zh same underscore behavior

Signed-off-by: discord9 <discord9@163.com>

* fix: only add token with _ from en analyzer

Signed-off-by: discord9 <discord9@163.com>

* test: neg sqlness case

Signed-off-by: discord9 <discord9@163.com>

---------

Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* fix: manifest recovery scans after last version if possible (#8009)

* feat: suppport scan with start after

Signed-off-by: evenyag <realevenyag@gmail.com>

* test: add start_after test

Signed-off-by: evenyag <realevenyag@gmail.com>

* chore: adjust remove dir warning

Signed-off-by: evenyag <realevenyag@gmail.com>

* test: test list_with_start_after

Signed-off-by: evenyag <realevenyag@gmail.com>

* fix: update get_paths call with start_after arg in checkpoint test

Signed-off-by: evenyag <realevenyag@gmail.com>

* feat: log scan metrics

Signed-off-by: evenyag <realevenyag@gmail.com>

* fix: fix start_after on manifest dir

Signed-off-by: evenyag <realevenyag@gmail.com>

---------

Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: add a standalone flag in plugins during startup (#7974)

* chore: add a standalone flag in plugins during startup

Signed-off-by: shuiyisong <xixing.sys@gmail.com>

* chore: add derive

Signed-off-by: shuiyisong <xixing.sys@gmail.com>

---------

Signed-off-by: shuiyisong <xixing.sys@gmail.com>
Signed-off-by: WenyXu <wenymedia@gmail.com>

* chore: bump version to v1.0.1

Signed-off-by: WenyXu <wenymedia@gmail.com>

---------

Signed-off-by: WenyXu <wenymedia@gmail.com>
Signed-off-by: liyang <daviderli614@gmail.com>
Signed-off-by: Weixie Cui <cuiweixie@gmail.com>
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Signed-off-by: discord9 <discord9@163.com>
Signed-off-by: luofucong <luofc@foxmail.com>
Signed-off-by: yxrxy <yxrxytrigger@gmail.com>
Signed-off-by: Johannes Sluis <joesluis51@gmail.com>
Signed-off-by: shuiyisong <xixing.sys@gmail.com>
Co-authored-by: Ning Sun <sunng@protonmail.com>
Co-authored-by: liyang <daviderli614@gmail.com>
Co-authored-by: cui <cuiweixie@gmail.com>
Co-authored-by: discord9 <55937128+discord9@users.noreply.github.com>
Co-authored-by: discord9 <discord9@163.com>
Co-authored-by: Lei, HUANG <6406592+v0y4g3r@users.noreply.github.com>
Co-authored-by: Yingwen <realevenyag@gmail.com>
Co-authored-by: fys <40801205+fengys1996@users.noreply.github.com>
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
Co-authored-by: yxrxy <1532529704@qq.com>
Co-authored-by: Joe Sluis <43276756+JoeS51@users.noreply.github.com>
Co-authored-by: shuiyisong <113876041+shuiyisong@users.noreply.github.com>
This commit is contained in:
Weny Xu
2026-04-23 17:37:27 +08:00
committed by GitHub
parent f3dbf34c74
commit 8d2f92c01a
125 changed files with 6323 additions and 1711 deletions

View File

@@ -30,13 +30,72 @@ CLEAN_LATEST=$(echo "$LATEST_VERSION" | sed 's/^v//' | sed 's/-nightly-.*//')
echo "Current version: $CLEAN_CURRENT"
echo "Latest release version: $CLEAN_LATEST"
# Use sort -V to compare versions
HIGHER_VERSION=$(printf "%s\n%s" "$CLEAN_CURRENT" "$CLEAN_LATEST" | sort -V | tail -n1)
# Function to extract base version (without pre-release suffix)
get_base_version() {
echo "$1" | sed -E 's/-(alpha|beta|rc|pre).*//'
}
if [ "$HIGHER_VERSION" = "$CLEAN_CURRENT" ]; then
# Function to check if a version is pre-release
is_prerelease() {
[[ "$1" =~ -(alpha|beta|rc|pre) ]]
}
# Compare versions properly considering pre-release
compare_versions() {
local current=$1
local latest=$2
# Extract base versions
local current_base=$(get_base_version "$current")
local latest_base=$(get_base_version "$latest")
# Compare base versions first
HIGHER_BASE=$(printf "%s\n%s" "$current_base" "$latest_base" | sort -V | tail -n1)
if [ "$HIGHER_BASE" = "$latest_base" ] && [ "$current_base" != "$latest_base" ]; then
# Latest has higher base version
echo "current_older"
return
elif [ "$HIGHER_BASE" = "$current_base" ] && [ "$current_base" != "$latest_base" ]; then
# Current has higher base version
echo "current_newer"
return
fi
# Base versions are equal, compare pre-release status
if [ "$current_base" = "$latest_base" ]; then
# If current is pre-release and latest is not, current is older
if is_prerelease "$current" && ! is_prerelease "$latest"; then
echo "current_older"
return
fi
# If latest is pre-release and current is not, current is newer
if ! is_prerelease "$current" && is_prerelease "$latest"; then
echo "current_newer"
return
fi
fi
# Both are same type or different base versions already handled, use sort -V
HIGHER_VERSION=$(printf "%s\n%s" "$current" "$latest" | sort -V | tail -n1)
if [ "$HIGHER_VERSION" = "$current" ]; then
echo "current_newer_or_equal"
else
echo "current_older"
fi
}
RESULT=$(compare_versions "$CLEAN_CURRENT" "$CLEAN_LATEST")
if [ "$RESULT" = "current_newer" ] || [ "$RESULT" = "current_newer_or_equal" ]; then
echo "Current version ($CLEAN_CURRENT) is NEWER than or EQUAL to latest ($CLEAN_LATEST)"
echo "is-current-version-latest=true" >> $GITHUB_OUTPUT
if [ -n "$GITHUB_OUTPUT" ]; then
echo "is-current-version-latest=true" >> $GITHUB_OUTPUT
fi
else
echo "Current version ($CLEAN_CURRENT) is OLDER than latest ($CLEAN_LATEST)"
echo "is-current-version-latest=false" >> $GITHUB_OUTPUT
if [ -n "$GITHUB_OUTPUT" ]; then
echo "is-current-version-latest=false" >> $GITHUB_OUTPUT
fi
fi

View File

@@ -38,6 +38,11 @@ function upload_artifacts() {
curl -X PUT \
-u "$PROXY_USERNAME:$PROXY_PASSWORD" \
-F "file=@$file" \
--max-time 3600 \
--connect-timeout 20 \
--retry 5 \
--retry-delay 10 \
--retry-max-time 3000 \
"$TARGET_URL"
done
}
@@ -54,6 +59,11 @@ function update_version_info() {
curl -X PUT \
-u "$PROXY_USERNAME:$PROXY_PASSWORD" \
-F "file=@latest-version.txt" \
--max-time 3600 \
--connect-timeout 20 \
--retry 5 \
--retry-delay 10 \
--retry-max-time 3000 \
"$TARGET_URL"
fi
@@ -66,6 +76,11 @@ function update_version_info() {
curl -X PUT \
-u "$PROXY_USERNAME:$PROXY_PASSWORD" \
-F "file=@latest-nightly-version.txt" \
--max-time 3600 \
--connect-timeout 20 \
--retry 5 \
--retry-delay 10 \
--retry-max-time 3000 \
"$TARGET_URL"
fi
fi

162
Cargo.lock generated
View File

@@ -212,7 +212,7 @@ checksum = "d301b3b94cb4b2f23d7917810addbbaff90738e0ca2be692bd027e70d7e0330c"
[[package]]
name = "api"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"arrow-schema 57.3.0",
"common-base",
@@ -933,7 +933,7 @@ checksum = "1505bd5d3d116872e7271a6d4e16d81d0c8570876c8de68093a09ac269d8aac0"
[[package]]
name = "auth"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-trait",
@@ -1523,7 +1523,7 @@ dependencies = [
[[package]]
name = "cache"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"catalog",
"common-error",
@@ -1559,7 +1559,7 @@ checksum = "37b2a672a2cb129a2e41c10b1224bb368f9f37a2b16b612598138befd7b37eb5"
[[package]]
name = "catalog"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"arrow 57.3.0",
@@ -1894,7 +1894,7 @@ checksum = "b94f61472cee1439c0b966b47e3aca9ae07e45d070759512cd390ea2bebc6675"
[[package]]
name = "cli"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-stream",
"async-trait",
@@ -1951,7 +1951,7 @@ dependencies = [
[[package]]
name = "client"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"arc-swap",
@@ -1983,7 +1983,7 @@ dependencies = [
"serde_json",
"snafu 0.8.6",
"store-api",
"substrait 1.0.0",
"substrait 1.0.1",
"tokio",
"tokio-stream",
"tonic 0.14.2",
@@ -2023,7 +2023,7 @@ dependencies = [
[[package]]
name = "cmd"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-trait",
@@ -2155,7 +2155,7 @@ dependencies = [
[[package]]
name = "common-base"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"ahash 0.8.12",
"anymap2",
@@ -2175,14 +2175,14 @@ dependencies = [
[[package]]
name = "common-catalog"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"const_format",
]
[[package]]
name = "common-config"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"common-base",
"common-error",
@@ -2206,7 +2206,7 @@ dependencies = [
[[package]]
name = "common-datasource"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"arrow 57.3.0",
"arrow-schema 57.3.0",
@@ -2242,7 +2242,7 @@ dependencies = [
[[package]]
name = "common-decimal"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"bigdecimal 0.4.8",
"common-error",
@@ -2255,7 +2255,7 @@ dependencies = [
[[package]]
name = "common-error"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"common-macro",
"http 1.3.1",
@@ -2266,7 +2266,7 @@ dependencies = [
[[package]]
name = "common-event-recorder"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-trait",
@@ -2289,7 +2289,7 @@ dependencies = [
[[package]]
name = "common-frontend"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-trait",
@@ -2310,7 +2310,7 @@ dependencies = [
[[package]]
name = "common-function"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"ahash 0.8.12",
"api",
@@ -2348,6 +2348,7 @@ dependencies = [
"geohash",
"h3o",
"hyperloglogplus",
"icu_properties",
"jsonb",
"jsonpath-rust 0.7.5",
"memchr",
@@ -2373,7 +2374,7 @@ dependencies = [
[[package]]
name = "common-greptimedb-telemetry"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-trait",
"common-runtime",
@@ -2390,7 +2391,7 @@ dependencies = [
[[package]]
name = "common-grpc"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"arrow-flight",
@@ -2425,7 +2426,7 @@ dependencies = [
[[package]]
name = "common-grpc-expr"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"common-base",
@@ -2445,7 +2446,7 @@ dependencies = [
[[package]]
name = "common-macro"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"greptime-proto",
"once_cell",
@@ -2456,7 +2457,7 @@ dependencies = [
[[package]]
name = "common-mem-prof"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"anyhow",
"common-error",
@@ -2472,7 +2473,7 @@ dependencies = [
[[package]]
name = "common-memory-manager"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"common-error",
"common-macro",
@@ -2484,7 +2485,7 @@ dependencies = [
[[package]]
name = "common-meta"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"anymap2",
"api",
@@ -2555,7 +2556,7 @@ dependencies = [
[[package]]
name = "common-options"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"common-grpc",
"humantime-serde",
@@ -2565,11 +2566,11 @@ dependencies = [
[[package]]
name = "common-plugins"
version = "1.0.0"
version = "1.0.1"
[[package]]
name = "common-pprof"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"common-error",
"common-macro",
@@ -2580,7 +2581,7 @@ dependencies = [
[[package]]
name = "common-procedure"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-stream",
@@ -2609,7 +2610,7 @@ dependencies = [
[[package]]
name = "common-procedure-test"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-trait",
"common-procedure",
@@ -2619,7 +2620,7 @@ dependencies = [
[[package]]
name = "common-query"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-trait",
@@ -2645,7 +2646,7 @@ dependencies = [
[[package]]
name = "common-recordbatch"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"arc-swap",
"common-base",
@@ -2670,7 +2671,7 @@ dependencies = [
[[package]]
name = "common-runtime"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-trait",
"clap",
@@ -2699,7 +2700,7 @@ dependencies = [
[[package]]
name = "common-session"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"serde",
"strum 0.27.1",
@@ -2707,7 +2708,7 @@ dependencies = [
[[package]]
name = "common-sql"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"arrow-schema 57.3.0",
"common-base",
@@ -2727,7 +2728,7 @@ dependencies = [
[[package]]
name = "common-stat"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"common-base",
"common-runtime",
@@ -2742,7 +2743,7 @@ dependencies = [
[[package]]
name = "common-telemetry"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"backtrace",
"common-base",
@@ -2771,7 +2772,7 @@ dependencies = [
[[package]]
name = "common-test-util"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"client",
"common-grpc",
@@ -2784,7 +2785,7 @@ dependencies = [
[[package]]
name = "common-time"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"arrow 57.3.0",
"chrono",
@@ -2802,7 +2803,7 @@ dependencies = [
[[package]]
name = "common-version"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"cargo-manifest",
"const_format",
@@ -2812,7 +2813,7 @@ dependencies = [
[[package]]
name = "common-wal"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"common-base",
"common-error",
@@ -2835,7 +2836,7 @@ dependencies = [
[[package]]
name = "common-workload"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"common-telemetry",
"serde",
@@ -4197,7 +4198,7 @@ dependencies = [
[[package]]
name = "datanode"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"arrow-flight",
@@ -4265,7 +4266,7 @@ dependencies = [
[[package]]
name = "datatypes"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"arrow 57.3.0",
"arrow-array 57.3.0",
@@ -4943,7 +4944,7 @@ checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be"
[[package]]
name = "file-engine"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-trait",
@@ -5075,7 +5076,7 @@ checksum = "8bf7cc16383c4b8d58b9905a8509f02926ce3058053c056376248d958c9df1e8"
[[package]]
name = "flow"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"arrow 57.3.0",
@@ -5144,7 +5145,7 @@ dependencies = [
"sql",
"store-api",
"strum 0.27.1",
"substrait 1.0.0",
"substrait 1.0.1",
"table",
"tokio",
"tonic 0.14.2",
@@ -5205,7 +5206,7 @@ checksum = "28dd6caf6059519a65843af8fe2a3ae298b14b80179855aeb4adc2c1934ee619"
[[package]]
name = "frontend"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"arc-swap",
@@ -5681,7 +5682,7 @@ dependencies = [
[[package]]
name = "greptime-proto"
version = "0.1.0"
source = "git+https://github.com/GreptimeTeam/greptime-proto.git?rev=092ba1d01e2da676dca66cca7eebb55009da8ef8#092ba1d01e2da676dca66cca7eebb55009da8ef8"
source = "git+https://github.com/GreptimeTeam/greptime-proto.git?rev=26a50f4069f50c37d65b45e0d39ae0cb42de5425#26a50f4069f50c37d65b45e0d39ae0cb42de5425"
dependencies = [
"prost 0.14.1",
"prost-types 0.14.1",
@@ -5691,7 +5692,6 @@ dependencies = [
"strum_macros 0.25.3",
"tonic 0.14.2",
"tonic-prost",
"tonic-prost-build",
]
[[package]]
@@ -6453,7 +6453,7 @@ dependencies = [
[[package]]
name = "index"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-trait",
"asynchronous-codec",
@@ -7421,7 +7421,7 @@ checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897"
[[package]]
name = "log-query"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"chrono",
"common-error",
@@ -7433,7 +7433,7 @@ dependencies = [
[[package]]
name = "log-store"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-stream",
"async-trait",
@@ -7724,7 +7724,7 @@ dependencies = [
[[package]]
name = "meta-client"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-trait",
@@ -7755,7 +7755,7 @@ dependencies = [
[[package]]
name = "meta-srv"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-trait",
@@ -7855,7 +7855,7 @@ dependencies = [
[[package]]
name = "metric-engine"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"aquamarine",
@@ -7956,7 +7956,7 @@ dependencies = [
[[package]]
name = "mito-codec"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"bytes",
@@ -7981,7 +7981,7 @@ dependencies = [
[[package]]
name = "mito2"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"aquamarine",
@@ -8705,7 +8705,7 @@ dependencies = [
[[package]]
name = "object-store"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"anyhow",
"bytes",
@@ -8883,7 +8883,7 @@ dependencies = [
[[package]]
name = "opensrv-mysql"
version = "0.8.0"
source = "git+https://github.com/datafuselabs/opensrv?tag=v0.10.0#074bd8fb81da3c9e6d6a098a482f3380478b9c0b"
source = "git+https://github.com/GreptimeTeam/opensrv?rev=6c5a451544194b7bb60a8318d155d4f892b49f2c#6c5a451544194b7bb60a8318d155d4f892b49f2c"
dependencies = [
"async-trait",
"byteorder",
@@ -9032,7 +9032,7 @@ dependencies = [
[[package]]
name = "operator"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"ahash 0.8.12",
"api",
@@ -9092,7 +9092,7 @@ dependencies = [
"sql",
"sqlparser",
"store-api",
"substrait 1.0.0",
"substrait 1.0.1",
"table",
"tokio",
"tokio-util",
@@ -9368,7 +9368,7 @@ checksum = "e3c406c9e2aa74554e662d2c2ee11cd3e73756988800be7e6f5eddb16fed4699"
[[package]]
name = "partition"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"async-trait",
@@ -9724,7 +9724,7 @@ checksum = "8b870d8c151b6f2fb93e84a13146138f05d02ed11c7e7c54f8826aaaf7c9f184"
[[package]]
name = "pipeline"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"ahash 0.8.12",
"api",
@@ -9881,7 +9881,7 @@ dependencies = [
[[package]]
name = "plugins"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"auth",
"catalog",
@@ -10199,7 +10199,7 @@ dependencies = [
[[package]]
name = "promql"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"ahash 0.8.12",
"async-trait",
@@ -10551,7 +10551,7 @@ dependencies = [
[[package]]
name = "puffin"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-compression",
"async-trait",
@@ -10613,7 +10613,7 @@ dependencies = [
[[package]]
name = "query"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"ahash 0.8.12",
"api",
@@ -10680,7 +10680,7 @@ dependencies = [
"sql",
"sqlparser",
"store-api",
"substrait 1.0.0",
"substrait 1.0.1",
"table",
"tokio",
"tokio-stream",
@@ -11984,7 +11984,7 @@ dependencies = [
[[package]]
name = "servers"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"ahash 0.8.12",
"api",
@@ -12034,6 +12034,7 @@ dependencies = [
"datafusion-pg-catalog",
"datatypes",
"derive_builder 0.20.2",
"either",
"futures",
"futures-util",
"headers",
@@ -12080,6 +12081,7 @@ dependencies = [
"regex",
"reqwest",
"rust-embed",
"rust_decimal",
"rustls",
"rustls-pemfile",
"rustls-pki-types",
@@ -12118,7 +12120,7 @@ dependencies = [
[[package]]
name = "session"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"ahash 0.8.12",
"api",
@@ -12450,7 +12452,7 @@ dependencies = [
[[package]]
name = "sql"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"arrow-buffer 57.3.0",
@@ -12511,7 +12513,7 @@ dependencies = [
[[package]]
name = "sqlness-runner"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-trait",
"clap",
@@ -12791,7 +12793,7 @@ dependencies = [
[[package]]
name = "standalone"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-trait",
"catalog",
@@ -12835,7 +12837,7 @@ checksum = "a2eb9349b6444b326872e140eb1cf5e7c522154d69e7a0ffb0fb81c06b37543f"
[[package]]
name = "store-api"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"aquamarine",
@@ -13027,7 +13029,7 @@ dependencies = [
[[package]]
name = "substrait"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"async-trait",
"bytes",
@@ -13149,7 +13151,7 @@ dependencies = [
[[package]]
name = "table"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"arc-swap",
@@ -13419,7 +13421,7 @@ checksum = "8f50febec83f5ee1df3015341d8bd429f2d1cc62bcba7ea2076759d315084683"
[[package]]
name = "tests-fuzz"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"arbitrary",
"async-trait",
@@ -13463,7 +13465,7 @@ dependencies = [
[[package]]
name = "tests-integration"
version = "1.0.0"
version = "1.0.1"
dependencies = [
"api",
"arrow-flight",
@@ -13501,6 +13503,7 @@ dependencies = [
"datanode",
"datatypes",
"dotenv",
"either",
"flate2",
"flow",
"frontend",
@@ -13516,7 +13519,6 @@ dependencies = [
"meta-client",
"meta-srv",
"mito2",
"moka",
"mysql_async",
"object-store",
"opentelemetry-proto 0.31.0",
@@ -13540,7 +13542,7 @@ dependencies = [
"sqlx",
"standalone",
"store-api",
"substrait 1.0.0",
"substrait 1.0.1",
"table",
"tempfile",
"time",

View File

@@ -75,7 +75,7 @@ members = [
resolver = "2"
[workspace.package]
version = "1.0.0"
version = "1.0.1"
edition = "2024"
license = "Apache-2.0"
@@ -154,13 +154,14 @@ etcd-client = { version = "0.17", features = [
fst = "0.4.7"
futures = "0.3"
futures-util = "0.3"
greptime-proto = { git = "https://github.com/GreptimeTeam/greptime-proto.git", rev = "092ba1d01e2da676dca66cca7eebb55009da8ef8" }
greptime-proto = { git = "https://github.com/GreptimeTeam/greptime-proto.git", rev = "26a50f4069f50c37d65b45e0d39ae0cb42de5425" }
hex = "0.4"
http = "1"
humantime = "2.1"
humantime-serde = "1.1"
hyper = "1.1"
hyper-util = "0.1"
icu_properties = "2.0.1"
itertools = "0.14"
jsonb = { version = "0.4.4", default-features = false }
lazy_static = "1.4"

View File

@@ -12,7 +12,9 @@ footer = ""
body = """
# {{ version }}
{% if timestamp -%}
Release date: {{ timestamp | date(format="%B %d, %Y") }}
{% endif -%}
{%- set breakings = commits | filter(attribute="breaking", value=true) -%}
{%- if breakings | length > 0 %}
@@ -118,7 +120,10 @@ filter_commits = false
# regex for skipping tags
# skip_tags = ""
# regex for ignoring tags
ignore_tags = ".*-nightly-.*"
# Ignore nightly tags and build-suffixed release tags such as
# v1.0.0-rc.2-13cdfa9b5-20260325-1774407105 so their commits are merged into
# the next visible release section instead of creating extra headings.
ignore_tags = ".*-nightly-.*|^v[0-9]+\\.[0-9]+\\.[0-9]+(-(alpha|beta|rc)\\.[0-9]+)?-[0-9a-f]{7,}-[0-9]{8}-[0-9]+$"
# sort the tags topologically
topo_order = false
# sort the commits inside sections by oldest/newest order

View File

@@ -9,6 +9,6 @@ catalog.workspace = true
common-error.workspace = true
common-macro.workspace = true
common-meta.workspace = true
moka.workspace = true
moka = { workspace = true, features = ["future"] }
partition.workspace = true
snafu.workspace = true

View File

@@ -220,18 +220,8 @@ impl PrefixedAzblobConnection {
name: "AzBlob",
required: [
(&self.azblob_container, "container"),
(&self.azblob_root, "root"),
(&self.azblob_account_name, "account name"),
(&self.azblob_endpoint, "endpoint"),
],
custom_validator: |missing: &mut Vec<&str>| {
// account_key is only required if sas_token is not provided
if self.azblob_sas_token.is_none()
&& self.azblob_account_key.is_empty()
{
missing.push("account key (when sas_token is not provided)");
}
}
]
)
}
}

View File

@@ -1084,7 +1084,7 @@ mod tests {
#[tokio::test]
async fn test_export_command_build_with_azblob_empty_account_name() {
// Test Azure Blob with empty account_name
// account_name is optional for Azure Blob validation
let cmd = ExportCommand::parse_from([
"export",
"--addr",
@@ -1092,30 +1092,19 @@ mod tests {
"--azblob",
"--azblob-container",
"test-container",
"--azblob-root",
"test-root",
"--azblob-account-name",
"", // Empty account name
"--azblob-account-key",
MOCK_AZBLOB_ACCOUNT_KEY_B64,
"--azblob-endpoint",
"https://account.blob.core.windows.net",
]);
let result = cmd.build().await;
assert!(result.is_err());
if let Err(err) = result {
assert!(
err.to_string().contains("AzBlob account name must be set"),
"Actual error: {}",
err
);
}
assert!(result.is_ok(), "Empty account_name should succeed");
}
#[tokio::test]
async fn test_export_command_build_with_azblob_missing_account_key() {
// Missing account key
// account_key is optional for Azure Blob validation
let cmd = ExportCommand::parse_from([
"export",
"--addr",
@@ -1123,24 +1112,12 @@ mod tests {
"--azblob",
"--azblob-container",
"test-container",
"--azblob-root",
"test-root",
"--azblob-account-name",
"test-account",
"--azblob-endpoint",
"https://account.blob.core.windows.net",
]);
let result = cmd.build().await;
assert!(result.is_err());
if let Err(err) = result {
assert!(
err.to_string()
.contains("AzBlob account key (when sas_token is not provided) must be set"),
"Actual error: {}",
err
);
}
assert!(result.is_ok(), "Missing account_key should succeed");
}
// ==================== Gap 3: Boundary cases ====================
@@ -1238,21 +1215,58 @@ mod tests {
"--azblob",
"--azblob-container",
"test-container",
"--azblob-root",
"test-root",
"--azblob-account-name",
"test-account",
"--azblob-account-key",
MOCK_AZBLOB_ACCOUNT_KEY_B64,
"--azblob-endpoint",
"https://account.blob.core.windows.net",
// No sas_token
]);
let result = cmd.build().await;
assert!(result.is_ok(), "Minimal AzBlob config should succeed");
}
#[tokio::test]
async fn test_export_command_build_with_azblob_missing_endpoint() {
let cmd = ExportCommand::parse_from([
"export",
"--addr",
"127.0.0.1:4000",
"--azblob",
"--azblob-container",
"test-container",
]);
let result = cmd.build().await;
assert!(result.is_err());
if let Err(err) = result {
assert!(
err.to_string().contains("AzBlob endpoint must be set"),
"Actual error: {}",
err
);
}
}
#[tokio::test]
async fn test_export_command_build_with_azblob_missing_container() {
let cmd = ExportCommand::parse_from([
"export",
"--addr",
"127.0.0.1:4000",
"--azblob",
"--azblob-endpoint",
"https://account.blob.core.windows.net",
]);
let result = cmd.build().await;
assert!(result.is_err());
if let Err(err) = result {
assert!(
err.to_string().contains("AzBlob container must be set"),
"Actual error: {}",
err
);
}
}
#[tokio::test]
async fn test_export_command_build_with_local_and_s3() {
// Both output-dir and S3 - S3 should take precedence
@@ -1287,7 +1301,7 @@ mod tests {
#[tokio::test]
async fn test_export_command_build_with_azblob_only_sas_token() {
// Azure Blob with sas_token but no account_key - should succeed
// Azure Blob with sas_token but no credentials - should still succeed
let cmd = ExportCommand::parse_from([
"export",
"--addr",
@@ -1295,15 +1309,10 @@ mod tests {
"--azblob",
"--azblob-container",
"test-container",
"--azblob-root",
"test-root",
"--azblob-account-name",
"test-account",
"--azblob-endpoint",
"https://account.blob.core.windows.net",
"--azblob-sas-token",
"test-sas-token",
// No account_key
]);
let result = cmd.build().await;
@@ -1324,10 +1333,6 @@ mod tests {
"--azblob",
"--azblob-container",
"test-container",
"--azblob-root",
"test-root",
"--azblob-account-name",
"test-account",
"--azblob-account-key",
"", // Empty account_key is OK if sas_token is provided
"--azblob-endpoint",

View File

@@ -72,7 +72,7 @@ meta-client.workspace = true
meta-srv.workspace = true
metric-engine.workspace = true
mito2.workspace = true
moka.workspace = true
moka = { workspace = true, features = ["future"] }
object-store.workspace = true
parquet = { workspace = true, features = ["object_store"] }
plugins.workspace = true

View File

@@ -102,31 +102,79 @@ impl Command {
#[cfg(test)]
mod tests {
use std::net::TcpListener;
use std::ops::RangeInclusive;
use clap::Parser;
use client::{Client, Database};
use common_catalog::consts::{DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME};
use common_telemetry::logging::LoggingOptions;
use rand::Rng;
use crate::error::Result as CmdResult;
use crate::options::GlobalOptions;
use crate::{App, cli, standalone};
fn random_standalone_addrs() -> (String, String, String, String) {
let offset = choose_random_unused_port_offset(14000..=24000, 10);
(
format!("127.0.0.1:{}", 4000 + offset),
format!("127.0.0.1:{}", 4001 + offset),
format!("127.0.0.1:{}", 4002 + offset),
format!("127.0.0.1:{}", 4003 + offset),
)
}
fn choose_random_unused_port_offset(
port_range: RangeInclusive<u16>,
max_attempts: usize,
) -> u16 {
let mut rng = rand::rng();
for _ in 0..max_attempts {
let http_port = rng.random_range(port_range.clone());
let offset = http_port - 4000;
let ports = [4000 + offset, 4001 + offset, 4002 + offset, 4003 + offset];
let listeners = ports
.into_iter()
.map(|port| TcpListener::bind(("127.0.0.1", port)))
.collect::<Result<Vec<_>, _>>();
if listeners.is_ok() {
return offset;
}
}
panic!("failed to find unused standalone test ports");
}
#[tokio::test(flavor = "multi_thread")]
async fn test_export_create_table_with_quoted_names() -> CmdResult<()> {
let output_dir = tempfile::tempdir().unwrap();
let (http_addr, rpc_addr, mysql_addr, postgres_addr) = random_standalone_addrs();
let standalone = standalone::Command::parse_from([
"standalone",
"start",
"--data-home",
&*output_dir.path().to_string_lossy(),
"--http-addr",
&http_addr,
"--rpc-bind-addr",
&rpc_addr,
"--mysql-addr",
&mysql_addr,
"--postgres-addr",
&postgres_addr,
]);
let standalone_opts = standalone.load_options(&GlobalOptions::default()).unwrap();
let mut instance = standalone.build(standalone_opts).await?;
instance.start().await?;
let client = Client::with_urls(["127.0.0.1:4001"]);
let client = Client::with_urls([rpc_addr.as_str()]);
let database = Database::new(DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME, client);
database
.sql(r#"CREATE DATABASE "cli.export.create_table";"#)
@@ -149,7 +197,7 @@ mod tests {
"data",
"export",
"--addr",
"127.0.0.1:4000",
&http_addr,
"--output-dir",
&*output_dir.path().to_string_lossy(),
"--target",

View File

@@ -42,6 +42,7 @@ use common_meta::region_keeper::MemoryRegionKeeper;
use common_meta::region_registry::LeaderRegionRegistry;
use common_meta::sequence::{Sequence, SequenceBuilder};
use common_meta::wal_provider::{WalProviderRef, build_wal_provider};
use common_options::plugin_options::StandaloneFlag;
use common_procedure::ProcedureManagerRef;
use common_query::prelude::set_default_prefix;
use common_telemetry::info;
@@ -369,6 +370,7 @@ impl StartCommand {
creator: InstanceCreator,
) -> Result<(Instance, InstanceCreatorResult)> {
let mut plugins = Plugins::new();
plugins.insert(StandaloneFlag);
set_default_prefix(opts.default_column_prefix.as_deref())
.map_err(BoxedError::new)
.context(error::BuildCliSnafu)?;

View File

@@ -47,6 +47,7 @@ geo-types = { version = "0.7", optional = true }
geohash = { version = "0.13", optional = true }
h3o = { version = "0.6", optional = true }
hyperloglogplus = "0.4"
icu_properties.workspace = true
jsonb.workspace = true
jsonpath-rust = "0.7.5"
memchr = "2.7"

View File

@@ -128,7 +128,7 @@ mod tests {
};
let result = f.invoke_async_with_args(func_args).await.unwrap_err();
assert_eq!(
"Execution error: Handler error: Missing TableMutationHandler, not expected",
"Execution error: Missing TableMutationHandler, not expected",
result.to_string()
);
}

View File

@@ -355,7 +355,7 @@ mod tests {
};
let result = f.invoke_async_with_args(func_args).await.unwrap_err();
assert_eq!(
"Execution error: Handler error: Missing TableMutationHandler, not expected",
"Execution error: Missing TableMutationHandler, not expected",
result.to_string()
);
}

View File

@@ -173,7 +173,7 @@ mod tests {
};
let result = f.invoke_async_with_args(func_args).await.unwrap_err();
assert_eq!(
"Execution error: Handler error: Missing ProcedureServiceHandler, not expected",
"Execution error: Missing ProcedureServiceHandler, not expected",
result.to_string()
);
}

View File

@@ -149,7 +149,7 @@ mod test {
let result = f.invoke_async_with_args(func_args).await.unwrap_err();
assert_eq!(
"Execution error: Handler error: Missing FlowServiceHandler, not expected",
"Execution error: Missing FlowServiceHandler, not expected",
result.to_string()
);
}

View File

@@ -20,7 +20,10 @@ use common_query::error::InvalidFuncArgsSnafu;
use datafusion_common::arrow::array::{Array, AsArray, StringViewBuilder, UInt32Builder};
use datafusion_common::arrow::compute;
use datafusion_common::arrow::datatypes::{DataType, UInt32Type};
use datafusion_expr::{ColumnarValue, ScalarFunctionArgs, Signature, TypeSignature, Volatility};
use datafusion_expr::{
Coercion, ColumnarValue, ScalarFunctionArgs, Signature, TypeSignature, TypeSignatureClass,
Volatility,
};
use derive_more::Display;
use crate::function::{Function, extract_args};
@@ -44,7 +47,7 @@ impl Default for Ipv4NumToString {
fn default() -> Self {
Self {
signature: Signature::new(
TypeSignature::Exact(vec![DataType::UInt32]),
TypeSignature::Coercible(vec![Coercion::new_exact(TypeSignatureClass::Integer)]),
Volatility::Immutable,
),
aliases: ["inet_ntoa".to_string()],
@@ -70,6 +73,14 @@ impl Function for Ipv4NumToString {
args: ScalarFunctionArgs,
) -> datafusion_common::Result<ColumnarValue> {
let [arg0] = extract_args(self.name(), &args)?;
let arg0 = compute::cast_with_options(
&arg0,
&DataType::UInt32,
&compute::CastOptions {
safe: false,
..Default::default()
},
)?;
let uint_vec = arg0.as_primitive::<UInt32Type>();
let size = uint_vec.len();
@@ -171,7 +182,7 @@ mod tests {
use std::sync::Arc;
use arrow_schema::Field;
use datafusion_common::arrow::array::{StringViewArray, UInt32Array};
use datafusion_common::arrow::array::{Int64Array, StringViewArray, UInt32Array};
use super::*;
@@ -200,6 +211,51 @@ mod tests {
assert_eq!(result.value(3), "255.255.255.255");
}
#[test]
fn test_ipv4_num_to_string_accepts_int64() {
let func = Ipv4NumToString::default();
// Test data
let values = vec![167772161i64, 3232235521i64, 0i64, 4294967295i64];
let input = ColumnarValue::Array(Arc::new(Int64Array::from(values)));
let args = ScalarFunctionArgs {
args: vec![input],
arg_fields: vec![],
number_rows: 4,
return_field: Arc::new(Field::new("x", DataType::Utf8View, false)),
config_options: Arc::new(Default::default()),
};
let result = func.invoke_with_args(args).unwrap();
let result = result.to_array(4).unwrap();
let result = result.as_string_view();
assert_eq!(result.value(0), "10.0.0.1");
assert_eq!(result.value(1), "192.168.0.1");
assert_eq!(result.value(2), "0.0.0.0");
assert_eq!(result.value(3), "255.255.255.255");
}
#[test]
fn test_ipv4_num_to_string_rejects_negative_int64() {
let func = Ipv4NumToString::default();
// Test data
let values = vec![-1i64];
let input = ColumnarValue::Array(Arc::new(Int64Array::from(values)));
let args = ScalarFunctionArgs {
args: vec![input],
arg_fields: vec![],
number_rows: 1,
return_field: Arc::new(Field::new("x", DataType::Utf8View, false)),
config_options: Arc::new(Default::default()),
};
let result = func.invoke_with_args(args);
assert!(result.is_err());
}
#[test]
fn test_ipv4_string_to_num() {
let func = Ipv4StringToNum::default();

View File

@@ -20,6 +20,8 @@ use datafusion_common::arrow::compute;
use datafusion_common::arrow::datatypes::DataType;
use datafusion_common::{DataFusionError, ScalarValue};
use datafusion_expr::{ColumnarValue, ScalarFunctionArgs, Signature, Volatility};
use icu_properties::props::Script;
use icu_properties::{CodePointMapData, CodePointMapDataBorrowed};
use memchr::memmem;
use crate::function::Function;
@@ -27,10 +29,11 @@ use crate::function_registry::FunctionRegistry;
/// Exact term/phrase matching function for text columns.
///
/// This function checks if a text column contains exact term/phrase matches
/// with non-alphanumeric boundaries. Designed for:
/// - Whole-word matching (e.g. "cat" in "cat!" but not in "category")
/// This function uses script-aware matching rules:
/// - ASCII-only terms keep whole-word style boundary matching, like Whole-word matching (e.g. "cat" in "cat!" but not in "category")
/// - Phrase matching (e.g. "hello world" in "note:hello world!")
/// - Terms containing Han characters match as contiguous substrings
/// - Mixed-script identifiers and numeric terms remain searchable in Chinese text
///
/// # Signature
/// `matches_term(text: String, term: String) -> Boolean`
@@ -43,9 +46,8 @@ use crate::function_registry::FunctionRegistry;
/// BooleanVector where each element indicates if the corresponding text
/// contains an exact match of the term, following these rules:
/// 1. Exact substring match found (case-sensitive)
/// 2. Match boundaries are either:
/// - Start/end of text
/// - Any non-alphanumeric character (including spaces, hyphens, punctuation, etc.)
/// 2. For ASCII-only terms, adjacent ASCII word characters block the match
/// 3. For Han-containing terms, contiguous substring match is sufficient
///
/// # Examples
/// ```
@@ -60,6 +62,9 @@ use crate::function_registry::FunctionRegistry;
/// SELECT matches_term(column, 'critical error') FROM logs;
/// -- Match in: "ERROR:critical error!"
/// -- No match: "critical_errors"
/// -- Chinese substring examples --
/// SELECT matches_term(column, '手机') FROM table;
/// -- Text: "登录手机号18888888888的动态key" => true
///
/// -- Empty string handling --
/// SELECT matches_term(column, '') FROM table;
@@ -204,9 +209,8 @@ impl Function for MatchesTermFunction {
///
/// A term is considered matched when:
/// 1. The exact sequence appears in the text
/// 2. It is either:
/// - At the start/end of text with adjacent non-alphanumeric character
/// - Surrounded by non-alphanumeric characters
/// 2. ASCII-only terms are not adjacent to ASCII word characters
/// 3. Han-containing terms match as contiguous substrings
///
/// # Examples
/// ```
@@ -215,28 +219,105 @@ impl Function for MatchesTermFunction {
/// assert!(finder.find("dog,cat")); // Term preceded by comma
/// assert!(!finder.find("category")); // Partial match rejected
///
/// let finder = MatchesTermFinder::new("world");
/// assert!(finder.find("hello-world")); // Hyphen boundary
/// let finder = MatchesTermFinder::new("手机");
/// assert!(finder.find("登录手机号18888888888的动态key"));
/// ```
#[derive(Clone, Debug)]
pub struct MatchesTermFinder {
finder: memmem::Finder<'static>,
term: String,
starts_with_non_alnum: bool,
ends_with_non_alnum: bool,
term_kind: TermKind,
starts_with_other: bool,
ends_with_other: bool,
}
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
enum CharClass {
AsciiWord,
Han,
UnicodeWord,
Other,
}
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
enum TermKind {
AsciiLike,
UnicodeWord,
HanContaining,
}
fn classify_char(c: char) -> CharClass {
if c.is_ascii_alphanumeric() {
CharClass::AsciiWord
} else if is_han(c) {
CharClass::Han
} else if c.is_alphanumeric() {
CharClass::UnicodeWord
} else {
CharClass::Other
}
}
static HAN_SCRIPT_DATA: CodePointMapDataBorrowed<'static, Script> =
CodePointMapData::<Script>::new();
fn is_han(c: char) -> bool {
HAN_SCRIPT_DATA.get(c) == Script::Han
}
fn classify_term(term: &str) -> TermKind {
let mut has_han = false;
let mut has_unicode_word = false;
for c in term.chars() {
match classify_char(c) {
CharClass::AsciiWord => {}
CharClass::Han => has_han = true,
CharClass::UnicodeWord => has_unicode_word = true,
CharClass::Other => {}
}
}
if has_han {
TermKind::HanContaining
} else if has_unicode_word {
TermKind::UnicodeWord
} else {
TermKind::AsciiLike
}
}
fn boundary_ok(term_kind: TermKind, neighbor: Option<char>, term_has_other_boundary: bool) -> bool {
if term_has_other_boundary {
return true;
}
match term_kind {
TermKind::AsciiLike => !matches!(neighbor.map(classify_char), Some(CharClass::AsciiWord)),
TermKind::UnicodeWord => !matches!(
neighbor.map(classify_char),
Some(CharClass::AsciiWord | CharClass::UnicodeWord | CharClass::Han)
),
TermKind::HanContaining => true,
}
}
impl MatchesTermFinder {
/// Create a new `MatchesTermFinder` for the given term.
pub fn new(term: &str) -> Self {
let starts_with_non_alnum = term.chars().next().is_some_and(|c| !c.is_alphanumeric());
let ends_with_non_alnum = term.chars().last().is_some_and(|c| !c.is_alphanumeric());
let starts_with_other = term
.chars()
.next()
.is_some_and(|c| classify_char(c) == CharClass::Other);
let ends_with_other = term
.chars()
.last()
.is_some_and(|c| classify_char(c) == CharClass::Other);
Self {
finder: memmem::Finder::new(term).into_owned(),
term: term.to_string(),
starts_with_non_alnum,
ends_with_non_alnum,
term_kind: classify_term(term),
starts_with_other,
ends_with_other,
}
}
@@ -254,21 +335,17 @@ impl MatchesTermFinder {
while let Some(found_pos) = self.finder.find(&text.as_bytes()[pos..]) {
let actual_pos = pos + found_pos;
let prev_ok = self.starts_with_non_alnum
|| text[..actual_pos]
.chars()
.last()
.map(|c| !c.is_alphanumeric())
.unwrap_or(true);
let prev = text[..actual_pos].chars().last();
let prev_ok = self.starts_with_other || boundary_ok(self.term_kind, prev, false);
if prev_ok {
if self.term_kind == TermKind::HanContaining {
return true;
}
let next_pos = actual_pos + self.finder.needle().len();
let next_ok = self.ends_with_non_alnum
|| text[next_pos..]
.chars()
.next()
.map(|c| !c.is_alphanumeric())
.unwrap_or(true);
let next = text[next_pos..].chars().next();
let next_ok = self.ends_with_other || boundary_ok(self.term_kind, next, false);
if next_ok {
return true;
@@ -369,6 +446,25 @@ mod tests {
assert!(!MatchesTermFinder::new("v1.0").find("v1.0a"));
}
#[test]
fn mixed_script_terms_match_inside_chinese_context() {
let text = "登录手机号18888888888的动态key";
assert!(MatchesTermFinder::new("手机号").find(text));
assert!(MatchesTermFinder::new("18888888888").find(text));
assert!(MatchesTermFinder::new("手机").find(text));
assert!(MatchesTermFinder::new("机号").find(text));
assert!(MatchesTermFinder::new("机号1888").find(text));
assert!(MatchesTermFinder::new("农业").find("中国农业银行"));
assert!(MatchesTermFinder::new("error").find("错误error日志"));
}
#[test]
fn underscore_still_counts_as_boundary_for_ascii_terms() {
assert!(MatchesTermFinder::new("world").find("hello_world"));
assert!(MatchesTermFinder::new("id").find("trace_id=abc"));
assert!(!MatchesTermFinder::new("error").find("criticalerrors"));
}
#[test]
fn adjacent_alphanumeric_fails() {
assert!(!MatchesTermFinder::new("cat").find("cat5"));
@@ -406,4 +502,18 @@ mod tests {
assert!(MatchesTermFinder::new("中文").find("这是中文测试,中文!"));
assert!(MatchesTermFinder::new("error").find("错误errorerror日志_error!"));
}
#[test]
fn han_terms_match_as_contiguous_substrings() {
assert!(MatchesTermFinder::new("行账号").find("中国农业银行账号"));
assert!(MatchesTermFinder::new("登录").find("登录手机号18888888888的动态key"));
}
#[test]
fn han_detection_uses_script_not_all_cjk() {
assert!(is_han('汉'));
assert!(is_han('\u{30000}'));
assert!(!is_han('あ'));
assert!(!is_han('한'));
}
}

View File

@@ -316,14 +316,14 @@ fn build_struct(
.#handler
.as_ref()
.context(#snafu_type)
.map_err(|e| datafusion_common::DataFusionError::Execution(format!("Handler error: {}", e.output_msg())))?;
.map_err(|e| datafusion_common::DataFusionError::Execution(e.output_msg()))?;
let mut builder = store_api::storage::ConcreteDataType::#ret()
.create_mutable_vector(rows_num);
if columns_num == 0 {
let result = #fn_name(handler, query_ctx, &[]).await
.map_err(|e| datafusion_common::DataFusionError::Execution(format!("Function execution error: {}", e.output_msg())))?;
.map_err(|e| datafusion_common::DataFusionError::Execution(e.output_msg()))?;
builder.push_value_ref(&result.as_value_ref());
} else {
@@ -333,7 +333,7 @@ fn build_struct(
.collect();
let result = #fn_name(handler, query_ctx, &args).await
.map_err(|e| datafusion_common::DataFusionError::Execution(format!("Function execution error: {}", e.output_msg())))?;
.map_err(|e| datafusion_common::DataFusionError::Execution(e.output_msg()))?;
builder.push_value_ref(&result.as_value_ref());
}

View File

@@ -59,7 +59,7 @@ hex.workspace = true
humantime-serde.workspace = true
itertools.workspace = true
lazy_static.workspace = true
moka.workspace = true
moka = { workspace = true, features = ["future"] }
object-store.workspace = true
prometheus.workspace = true
prost.workspace = true

View File

@@ -573,4 +573,27 @@ mod tests {
let region_num = stat_val.region_num().unwrap();
assert_eq!(2, region_num);
}
#[test]
fn test_region_stat_from_heartbeat_preserves_staging_leader_role() {
let request = HeartbeatRequest {
header: Some(RequestHeader::default()),
peer: Some(api::v1::meta::Peer {
id: 1,
addr: "127.0.0.1:3001".to_string(),
}),
region_stats: vec![api::v1::meta::RegionStat {
region_id: RegionId::new(1024, 1).as_u64(),
engine: "mito".to_string(),
role: api::v1::meta::RegionRole::StagingLeader.into(),
..Default::default()
}],
..Default::default()
};
let stat = Stat::try_from(&request).unwrap();
assert_eq!(stat.region_stats.len(), 1);
assert_eq!(stat.region_stats[0].role, RegionRole::StagingLeader);
}
}

View File

@@ -45,7 +45,7 @@ use crate::lock_key::{CatalogLock, SchemaLock, TableNameLock};
use crate::metrics;
use crate::region_keeper::OperatingRegionGuard;
use crate::rpc::ddl::CreateTableTask;
use crate::rpc::router::{RegionRoute, operating_leader_regions};
use crate::rpc::router::{RegionRoute, operating_leader_region_roles};
pub struct CreateTableProcedure {
pub context: DdlContext,
@@ -172,8 +172,24 @@ impl CreateTableProcedure {
/// - [Code::Cancelled](tonic::status::Code::Cancelled)
/// - [Code::DeadlineExceeded](tonic::status::Code::DeadlineExceeded)
/// - [Code::Unavailable](tonic::status::Code::Unavailable)
pub async fn on_datanode_create_regions(&mut self) -> Result<Status> {
let table_route = self.table_route()?.clone();
pub async fn on_datanode_create_regions(&mut self, retrying: bool) -> Result<Status> {
let mut table_route = self.table_route()?.clone();
if retrying {
info!(
"Remapping region routes addresses for retrying create regions for table: {}",
self.data.table_ref()
);
let storage = self
.context
.table_metadata_manager
.table_route_manager()
.table_route_storage();
// The peer addresses may change during retries,
// so we always remap the region routes.
storage
.remap_region_routes(&mut table_route.region_routes)
.await?;
}
// Registers opening regions
let guards = self.register_opening_regions(&self.context, &table_route.region_routes)?;
if !guards.is_empty() {
@@ -240,17 +256,17 @@ impl CreateTableProcedure {
context: &DdlContext,
region_routes: &[RegionRoute],
) -> Result<Vec<OperatingRegionGuard>> {
let opening_regions = operating_leader_regions(region_routes);
let opening_regions = operating_leader_region_roles(region_routes);
if self.opening_regions.len() == opening_regions.len() {
return Ok(vec![]);
}
let mut opening_region_guards = Vec::with_capacity(opening_regions.len());
for (region_id, datanode_id) in opening_regions {
for (region_id, datanode_id, role) in opening_regions {
let guard = context
.memory_region_keeper
.register(datanode_id, region_id)
.register_with_role(datanode_id, region_id, role)
.context(error::RegionOperatingRaceSnafu {
region_id,
peer_id: datanode_id,
@@ -301,7 +317,10 @@ impl Procedure for CreateTableProcedure {
match state {
CreateTableState::Prepare => self.on_prepare().await,
CreateTableState::DatanodeCreateRegions => self.on_datanode_create_regions().await,
CreateTableState::DatanodeCreateRegions => {
let retrying = ctx.is_retrying().await.unwrap_or(false);
self.on_datanode_create_regions(retrying).await
}
CreateTableState::CreateMetadata => self.on_create_metadata(ctx.procedure_id).await,
}
.map_err(map_to_procedure_error)
@@ -339,7 +358,7 @@ pub struct CreateTableData {
#[serde(default)]
pub column_metadatas: Vec<ColumnMetadata>,
/// None stands for not allocated yet.
table_route: Option<PhysicalTableRouteValue>,
pub(crate) table_route: Option<PhysicalTableRouteValue>,
/// None stands for not allocated yet.
pub region_wal_options: Option<HashMap<RegionNumber, String>>,
}

View File

@@ -58,6 +58,7 @@ pub(crate) struct DropDatabaseContext {
schema: String,
drop_if_exists: bool,
tables: Option<BoxStream<'static, Result<(String, TableNameValue)>>>,
retrying: bool,
}
#[async_trait::async_trait]
@@ -90,6 +91,7 @@ impl DropDatabaseProcedure {
schema,
drop_if_exists,
tables: None,
retrying: false,
},
state: Box::new(DropDatabaseStart),
}
@@ -110,6 +112,7 @@ impl DropDatabaseProcedure {
schema,
drop_if_exists,
tables: None,
retrying: false,
},
state,
})
@@ -136,9 +139,10 @@ impl Procedure for DropDatabaseProcedure {
})
}
async fn execute(&mut self, _ctx: &ProcedureContext) -> ProcedureResult<Status> {
async fn execute(&mut self, ctx: &ProcedureContext) -> ProcedureResult<Status> {
let state = &mut self.state;
self.context.retrying = ctx.is_retrying().await.unwrap_or(false);
let (next, status) = state
.next(&self.runtime_context, &mut self.context)
.await

View File

@@ -224,6 +224,7 @@ mod tests {
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
// Ticks
let (mut state, status) = state.next(&ddl_context, &mut ctx).await.unwrap();
@@ -259,6 +260,7 @@ mod tests {
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
// Ticks
let (state, status) = state.next(&ddl_context, &mut ctx).await.unwrap();
@@ -287,6 +289,7 @@ mod tests {
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
// Ticks
let (state, status) = state.next(&ddl_context, &mut ctx).await.unwrap();

View File

@@ -29,7 +29,7 @@ use crate::ddl::utils::get_region_wal_options;
use crate::error::{self, Result};
use crate::key::table_route::TableRouteValue;
use crate::region_keeper::OperatingRegionGuard;
use crate::rpc::router::{RegionRoute, operating_leader_regions};
use crate::rpc::router::{RegionRoute, operating_leader_region_roles};
#[derive(Debug, Serialize, Deserialize)]
pub(crate) struct DropDatabaseExecutor {
@@ -69,12 +69,12 @@ impl DropDatabaseExecutor {
if !self.dropping_regions.is_empty() {
return Ok(());
}
let dropping_regions = operating_leader_regions(&self.physical_region_routes);
let dropping_regions = operating_leader_region_roles(&self.physical_region_routes);
let mut dropping_region_guards = Vec::with_capacity(dropping_regions.len());
for (region_id, datanode_id) in dropping_regions {
for (region_id, datanode_id, role) in dropping_regions {
let guard = ddl_ctx
.memory_region_keeper
.register(datanode_id, region_id)
.register_with_role(datanode_id, region_id, role)
.context(error::RegionOperatingRaceSnafu {
region_id,
peer_id: datanode_id,
@@ -96,10 +96,25 @@ impl State for DropDatabaseExecutor {
async fn next(
&mut self,
ddl_ctx: &DdlContext,
_ctx: &mut DropDatabaseContext,
ctx: &mut DropDatabaseContext,
) -> Result<(Box<dyn State>, Status)> {
self.register_dropping_regions(ddl_ctx)?;
let executor = DropTableExecutor::new(self.table_name.clone(), self.table_id, true);
if ctx.retrying {
info!(
"Remapping region routes addresses for retrying drop regions for table_id: {}",
self.table_id
);
let storage = ddl_ctx
.table_metadata_manager
.table_route_manager()
.table_route_storage();
// The peer addresses may change during retries,
// so we always remap the region routes.
storage
.remap_region_routes(&mut self.physical_region_routes)
.await?;
}
// Deletes metadata for table permanently.
let table_route_value = TableRouteValue::new(
self.table_id,
@@ -144,6 +159,7 @@ impl State for DropDatabaseExecutor {
#[cfg(test)]
mod tests {
use std::collections::HashSet;
use std::sync::Arc;
use api::region::RegionResponse;
@@ -152,16 +168,21 @@ mod tests {
use common_error::ext::BoxedError;
use common_query::request::QueryRequest;
use common_recordbatch::SendableRecordBatchStream;
use store_api::region_engine::RegionRole;
use store_api::storage::RegionId;
use table::table_name::TableName;
use crate::ddl::drop_database::cursor::DropDatabaseCursor;
use crate::ddl::drop_database::executor::DropDatabaseExecutor;
use crate::ddl::drop_database::{DropDatabaseContext, DropTableTarget, State};
use crate::ddl::test_util::{create_logical_table, create_physical_table};
use crate::ddl::test_util::datanode_handler::DatanodeWatcher;
use crate::ddl::test_util::{
create_logical_table, create_physical_table, put_datanode_address,
};
use crate::error::{self, Error, Result};
use crate::key::datanode_table::DatanodeTableKey;
use crate::peer::Peer;
use crate::rpc::router::region_distribution;
use crate::rpc::router::{LeaderState, Region, RegionRoute, region_distribution};
use crate::test_util::{MockDatanodeHandler, MockDatanodeManager, new_ddl_context};
#[derive(Clone)]
@@ -206,6 +227,7 @@ mod tests {
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
let (state, status) = state.next(&ddl_context, &mut ctx).await.unwrap();
assert!(!status.need_persist());
@@ -218,6 +240,7 @@ mod tests {
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
let mut state = DropDatabaseExecutor::new(
physical_table_id,
@@ -258,6 +281,7 @@ mod tests {
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
let (state, status) = state.next(&ddl_context, &mut ctx).await.unwrap();
assert!(!status.need_persist());
@@ -270,6 +294,7 @@ mod tests {
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
let mut state = DropDatabaseExecutor::new(
logical_table_id,
@@ -360,6 +385,7 @@ mod tests {
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
let err = state.next(&ddl_context, &mut ctx).await.unwrap_err();
assert!(err.is_retry_later());
@@ -389,6 +415,7 @@ mod tests {
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
state.recover(&ddl_context).unwrap();
assert_eq!(state.dropping_regions.len(), 1);
@@ -398,4 +425,73 @@ mod tests {
assert_eq!(cursor.target, DropTableTarget::Physical);
}
}
#[tokio::test]
async fn test_recover_registers_region_role_from_routes() {
let node_manager = Arc::new(MockDatanodeManager::new(NaiveDatanodeHandler));
let ddl_context = new_ddl_context(node_manager);
let region_id = RegionId::new(1024, 1);
let mut state = DropDatabaseExecutor::new(
1024,
1024,
TableName::new(DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME, "phy"),
vec![RegionRoute {
region: Region::new_test(region_id),
leader_peer: Some(Peer::empty(7)),
follower_peers: vec![],
leader_state: Some(LeaderState::Downgrading),
leader_down_since: None,
write_route_policy: None,
}],
DropTableTarget::Physical,
);
state.recover(&ddl_context).unwrap();
let roles = ddl_context
.memory_region_keeper
.extract_operating_region_roles(7, &HashSet::from([region_id]));
assert_eq!(roles.get(&region_id), Some(&RegionRole::DowngradingLeader));
}
#[tokio::test]
async fn test_next_remaps_addresses_when_retrying() {
let (tx, mut rx) = tokio::sync::mpsc::channel(8);
let node_manager = Arc::new(MockDatanodeManager::new(DatanodeWatcher::new(tx)));
let ddl_context = new_ddl_context(node_manager);
let physical_table_id = create_physical_table(&ddl_context, "phy").await;
let (_, table_route) = ddl_context
.table_metadata_manager
.table_route_manager()
.get_physical_table_route(physical_table_id)
.await
.unwrap();
let mut state = DropDatabaseExecutor::new(
physical_table_id,
physical_table_id,
TableName::new(DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME, "phy"),
table_route.region_routes,
DropTableTarget::Physical,
);
state.physical_region_routes[0]
.leader_peer
.as_mut()
.unwrap()
.addr = "old-addr".to_string();
let mut ctx = DropDatabaseContext {
catalog: DEFAULT_CATALOG_NAME.to_string(),
schema: DEFAULT_SCHEMA_NAME.to_string(),
drop_if_exists: false,
tables: None,
retrying: true,
};
put_datanode_address(&ddl_context, 0, "new-addr").await;
state.next(&ddl_context, &mut ctx).await.unwrap();
let (peer, _) = rx.try_recv().unwrap();
assert_eq!(peer.addr, "new-addr");
}
}

View File

@@ -122,6 +122,7 @@ mod tests {
schema: "bar".to_string(),
drop_if_exists: true,
tables: None,
retrying: false,
};
let (state, status) = state.next(&ddl_context, &mut ctx).await.unwrap();
state
@@ -150,6 +151,7 @@ mod tests {
schema: "bar".to_string(),
drop_if_exists: true,
tables: None,
retrying: false,
};
let (state, status) = state.next(&ddl_context, &mut ctx).await.unwrap();
state

View File

@@ -93,6 +93,7 @@ mod tests {
schema: "bar".to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
let err = step.next(&ddl_context, &mut ctx).await.unwrap_err();
assert_matches!(err, error::Error::SchemaNotFound { .. });
@@ -108,6 +109,7 @@ mod tests {
schema: "bar".to_string(),
drop_if_exists: true,
tables: None,
retrying: false,
};
let (state, status) = state.next(&ddl_context, &mut ctx).await.unwrap();
state.as_any().downcast_ref::<DropDatabaseEnd>().unwrap();
@@ -130,6 +132,7 @@ mod tests {
schema: "bar".to_string(),
drop_if_exists: false,
tables: None,
retrying: false,
};
let (state, status) = state.next(&ddl_context, &mut ctx).await.unwrap();
state.as_any().downcast_ref::<DropDatabaseCursor>().unwrap();

View File

@@ -43,7 +43,7 @@ use crate::lock_key::{CatalogLock, SchemaLock, TableLock};
use crate::metrics;
use crate::region_keeper::OperatingRegionGuard;
use crate::rpc::ddl::DropTableTask;
use crate::rpc::router::{RegionRoute, operating_leader_regions};
use crate::rpc::router::{RegionRoute, operating_leader_region_roles};
pub struct DropTableProcedure {
/// The context of procedure runtime.
@@ -94,7 +94,7 @@ impl DropTableProcedure {
/// Register dropping regions if doesn't exist.
fn register_dropping_regions(&mut self) -> Result<()> {
let dropping_regions = operating_leader_regions(&self.data.physical_region_routes);
let dropping_regions = operating_leader_region_roles(&self.data.physical_region_routes);
if !self.dropping_regions.is_empty() {
return Ok(());
@@ -102,11 +102,11 @@ impl DropTableProcedure {
let mut dropping_region_guards = Vec::with_capacity(dropping_regions.len());
for (region_id, datanode_id) in dropping_regions {
for (region_id, datanode_id, role) in dropping_regions {
let guard = self
.context
.memory_region_keeper
.register(datanode_id, region_id)
.register_with_role(datanode_id, region_id, role)
.context(error::RegionOperatingRaceSnafu {
region_id,
peer_id: datanode_id,
@@ -154,7 +154,24 @@ impl DropTableProcedure {
Ok(Status::executing(true))
}
pub async fn on_datanode_drop_regions(&mut self) -> Result<Status> {
pub async fn on_datanode_drop_regions(&mut self, retrying: bool) -> Result<Status> {
if retrying {
info!(
"Remapping region routes addresses for retrying drop regions for table_id: {}",
self.data.table_id()
);
let storage = self
.context
.table_metadata_manager
.table_route_manager()
.table_route_storage();
// The peer addresses may change during retries,
// so we always remap the region routes.
storage
.remap_region_routes(&mut self.data.physical_region_routes)
.await?;
}
self.executor
.on_drop_regions(
&self.context.node_manager,
@@ -215,7 +232,7 @@ impl Procedure for DropTableProcedure {
Ok(())
}
async fn execute(&mut self, _ctx: &ProcedureContext) -> ProcedureResult<Status> {
async fn execute(&mut self, ctx: &ProcedureContext) -> ProcedureResult<Status> {
let state = &self.data.state;
let _timer = metrics::METRIC_META_PROCEDURE_DROP_TABLE
.with_label_values(&[state.as_ref()])
@@ -225,7 +242,10 @@ impl Procedure for DropTableProcedure {
DropTableState::Prepare => self.on_prepare().await,
DropTableState::DeleteMetadata => self.on_delete_metadata().await,
DropTableState::InvalidateTableCache => self.on_broadcast().await,
DropTableState::DatanodeDropRegions => self.on_datanode_drop_regions().await,
DropTableState::DatanodeDropRegions => {
let retrying = ctx.is_retrying().await.unwrap_or(false);
self.on_datanode_drop_regions(retrying).await
}
DropTableState::DeleteTombstone => self.on_delete_metadata_tombstone().await,
}
.map_err(map_to_procedure_error)

View File

@@ -41,8 +41,12 @@ use crate::ddl::test_util::create_table::{
TestCreateTableExprBuilder, build_raw_table_info_from_expr,
};
use crate::ddl::{DdlContext, TableMetadata};
use crate::key::node_address::{NodeAddressKey, NodeAddressValue};
use crate::key::table_route::TableRouteValue;
use crate::key::{MetadataKey, MetadataValue};
use crate::peer::Peer;
use crate::rpc::ddl::CreateTableTask;
use crate::rpc::store::PutRequest;
pub async fn create_physical_table_metadata(
ddl_context: &DdlContext,
@@ -56,6 +60,21 @@ pub async fn create_physical_table_metadata(
.unwrap();
}
pub async fn put_datanode_address(ddl_context: &DdlContext, node_id: u64, addr: &str) {
ddl_context
.table_metadata_manager
.kv_backend()
.put(PutRequest {
key: NodeAddressKey::with_datanode(node_id).to_bytes(),
value: NodeAddressValue::new(Peer::new(node_id, addr))
.try_as_raw_value()
.unwrap(),
..Default::default()
})
.await
.unwrap();
}
pub async fn create_physical_table(ddl_context: &DdlContext, name: &str) -> TableId {
// Prepares physical table metadata.
let mut create_physical_table_task = test_create_physical_table_task(name);

View File

@@ -13,7 +13,7 @@
// limitations under the License.
use std::assert_matches;
use std::collections::HashMap;
use std::collections::{HashMap, HashSet};
use std::sync::Arc;
use api::region::RegionResponse;
@@ -30,6 +30,7 @@ use datatypes::prelude::ConcreteDataType;
use datatypes::schema::ColumnSchema;
use store_api::metadata::ColumnMetadata;
use store_api::metric_engine_consts::TABLE_COLUMN_METADATA_EXTENSION_KEY;
use store_api::region_engine::RegionRole;
use store_api::storage::RegionId;
use tokio::sync::mpsc;
@@ -42,7 +43,7 @@ use crate::ddl::test_util::datanode_handler::{
DatanodeWatcher, NaiveDatanodeHandler, RetryErrorDatanodeHandler,
UnexpectedErrorDatanodeHandler,
};
use crate::ddl::test_util::{assert_column_name, get_raw_table_info};
use crate::ddl::test_util::{assert_column_name, get_raw_table_info, put_datanode_address};
use crate::error::{Error, Result};
use crate::key::table_route::TableRouteValue;
use crate::kv_backend::memory::MemoryKvBackend;
@@ -244,6 +245,27 @@ async fn test_on_datanode_create_regions_should_not_retry() {
assert!(!error.is_retry_later());
}
#[tokio::test]
async fn test_on_datanode_create_regions_remaps_addresses_when_retrying() {
let (tx, mut rx) = mpsc::channel(8);
let datanode_handler = DatanodeWatcher::new(tx).with_handler(create_request_handler);
let node_manager = Arc::new(MockDatanodeManager::new(datanode_handler));
let ddl_context = new_ddl_context(node_manager);
let task = test_create_table_task("foo");
let mut procedure = CreateTableProcedure::new(task, ddl_context.clone()).unwrap();
procedure.on_prepare().await.unwrap();
let table_route = procedure.data.table_route.as_mut().unwrap();
let leader = table_route.region_routes[0].leader_peer.as_mut().unwrap();
leader.addr = "old-addr".to_string();
put_datanode_address(&ddl_context, leader.id, "new-addr").await;
procedure.on_datanode_create_regions(true).await.unwrap();
let (peer, _) = rx.try_recv().unwrap();
assert_eq!(peer.addr, "new-addr");
}
#[tokio::test]
async fn test_on_create_metadata_error() {
common_telemetry::init_default_ut_logging();
@@ -330,6 +352,10 @@ async fn test_memory_region_keeper_guard_dropped_on_procedure_done() {
.memory_region_keeper
.contains(datanode_id, region_id)
);
let roles = ddl_context
.memory_region_keeper
.extract_operating_region_roles(datanode_id, &HashSet::from([region_id]));
assert_eq!(roles.get(&region_id), Some(&RegionRole::Leader));
execute_procedure_until_done(&mut procedure).await;

View File

@@ -12,7 +12,7 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::HashMap;
use std::collections::{HashMap, HashSet};
use std::sync::Arc;
use api::v1::region::{RegionRequest, region_request};
@@ -23,6 +23,7 @@ use common_procedure::Procedure;
use common_procedure_test::{
execute_procedure_until, execute_procedure_until_done, new_test_procedure_context,
};
use store_api::region_engine::RegionRole;
use store_api::storage::RegionId;
use table::metadata::TableId;
use tokio::sync::mpsc;
@@ -34,7 +35,7 @@ use crate::ddl::test_util::create_table::test_create_table_task;
use crate::ddl::test_util::datanode_handler::{DatanodeWatcher, NaiveDatanodeHandler};
use crate::ddl::test_util::{
create_logical_table, create_physical_table, create_physical_table_metadata,
test_create_logical_table_task, test_create_physical_table_task,
put_datanode_address, test_create_logical_table_task, test_create_physical_table_task,
};
use crate::key::table_route::TableRouteValue;
use crate::kv_backend::memory::MemoryKvBackend;
@@ -146,7 +147,7 @@ async fn test_on_datanode_drop_regions() {
// Drop table
let mut procedure = DropTableProcedure::new(task, ddl_context);
procedure.on_prepare().await.unwrap();
procedure.on_datanode_drop_regions().await.unwrap();
procedure.on_datanode_drop_regions(false).await.unwrap();
let check = |peer: Peer,
request: RegionRequest,
@@ -186,6 +187,50 @@ async fn test_on_datanode_drop_regions() {
check(peer, request, 5, RegionId::new(table_id, 1), true);
}
#[tokio::test]
async fn test_on_datanode_drop_regions_remaps_addresses_when_retrying() {
let (tx, mut rx) = mpsc::channel(8);
let datanode_handler = DatanodeWatcher::new(tx);
let node_manager = Arc::new(MockDatanodeManager::new(datanode_handler));
let ddl_context = new_ddl_context(node_manager);
let table_id = 1024;
let table_name = "foo";
let task = test_create_table_task(table_name, table_id);
ddl_context
.table_metadata_manager
.create_table_metadata(
task.table_info.clone(),
TableRouteValue::physical(vec![RegionRoute {
region: Region::new_test(RegionId::new(table_id, 1)),
leader_peer: Some(Peer::new(1, "old-leader")),
follower_peers: vec![Peer::new(5, "old-follower")],
leader_state: None,
leader_down_since: None,
write_route_policy: None,
}]),
HashMap::new(),
)
.await
.unwrap();
let task = new_drop_table_task(table_name, table_id, false);
let mut procedure = DropTableProcedure::new(task, ddl_context.clone());
procedure.on_prepare().await.unwrap();
put_datanode_address(&ddl_context, 1, "new-leader").await;
put_datanode_address(&ddl_context, 5, "new-follower").await;
procedure.on_datanode_drop_regions(true).await.unwrap();
let mut peers = Vec::new();
for _ in 0..2 {
peers.push(rx.try_recv().unwrap().0);
}
peers.sort_unstable_by_key(|p| p.id);
assert_eq!(peers[0].addr, "new-leader");
assert_eq!(peers[1].addr, "new-follower");
}
#[tokio::test]
async fn test_on_rollback() {
let node_manager = Arc::new(MockDatanodeManager::new(NaiveDatanodeHandler));
@@ -284,6 +329,10 @@ async fn test_memory_region_keeper_guard_dropped_on_procedure_done() {
.memory_region_keeper
.contains(datanode_id, region_id)
);
let roles = ddl_context
.memory_region_keeper
.extract_operating_region_roles(datanode_id, &HashSet::from([region_id]));
assert_eq!(roles.get(&region_id), Some(&RegionRole::Leader));
execute_procedure_until_done(&mut procedure).await;

View File

@@ -42,7 +42,7 @@ use crate::key::table_name::TableNameKey;
use crate::lock_key::{CatalogLock, SchemaLock, TableLock};
use crate::metrics;
use crate::rpc::ddl::TruncateTableTask;
use crate::rpc::router::{RegionRoute, find_leader_regions, find_leaders};
use crate::rpc::router::{find_leader_regions, find_leaders};
pub struct TruncateTableProcedure {
context: DdlContext,
@@ -94,12 +94,11 @@ impl TruncateTableProcedure {
pub(crate) fn new(
task: TruncateTableTask,
table_info_value: DeserializedValueWithBytes<TableInfoValue>,
region_routes: Vec<RegionRoute>,
context: DdlContext,
) -> Self {
Self {
context,
data: TruncateTableData::new(task, table_info_value, region_routes),
data: TruncateTableData::new(task, table_info_value),
}
}
@@ -138,13 +137,18 @@ impl TruncateTableProcedure {
async fn on_datanode_truncate_regions(&mut self) -> Result<Status> {
let table_id = self.data.table_id();
let region_routes = &self.data.region_routes;
let leaders = find_leaders(region_routes);
let (_, physical_table_route) = self
.context
.table_metadata_manager
.table_route_manager()
.get_physical_table_route(table_id)
.await?;
let leaders = find_leaders(&physical_table_route.region_routes);
let mut truncate_region_tasks = Vec::with_capacity(leaders.len());
for datanode in leaders {
let requester = self.context.node_manager.datanode(&datanode).await;
let regions = find_leader_regions(region_routes, &datanode);
let regions = find_leader_regions(&physical_table_route.region_routes, &datanode);
for region in regions {
let region_id = RegionId::new(table_id, region);
@@ -201,20 +205,17 @@ pub struct TruncateTableData {
state: TruncateTableState,
task: TruncateTableTask,
table_info_value: DeserializedValueWithBytes<TableInfoValue>,
region_routes: Vec<RegionRoute>,
}
impl TruncateTableData {
pub fn new(
task: TruncateTableTask,
table_info_value: DeserializedValueWithBytes<TableInfoValue>,
region_routes: Vec<RegionRoute>,
) -> Self {
Self {
state: TruncateTableState::Prepare,
task,
table_info_value,
region_routes,
}
}

View File

@@ -45,7 +45,7 @@ use crate::ddl::drop_view::DropViewProcedure;
use crate::ddl::truncate_table::TruncateTableProcedure;
use crate::ddl::{DdlContext, utils};
use crate::error::{
CreateRepartitionProcedureSnafu, EmptyDdlTasksSnafu, ProcedureOutputSnafu,
self, CreateRepartitionProcedureSnafu, EmptyDdlTasksSnafu, ProcedureOutputSnafu,
RegisterProcedureLoaderSnafu, RegisterRepartitionProcedureLoaderSnafu, Result,
SubmitProcedureSnafu, TableInfoNotFoundSnafu, TableNotFoundSnafu, TableRouteNotFoundSnafu,
UnexpectedLogicalRouteTableSnafu, WaitProcedureSnafu,
@@ -72,7 +72,6 @@ use crate::rpc::ddl::{
CreateTableTask, CreateViewTask, DropDatabaseTask, DropFlowTask, DropTableTask, DropViewTask,
QueryContext, SubmitDdlTaskRequest, SubmitDdlTaskResponse, TruncateTableTask,
};
use crate::rpc::router::RegionRoute;
/// A configurator that customizes or enhances a [`DdlManager`].
#[async_trait::async_trait]
@@ -521,15 +520,9 @@ impl DdlManager {
&self,
truncate_table_task: TruncateTableTask,
table_info_value: DeserializedValueWithBytes<TableInfoValue>,
region_routes: Vec<RegionRoute>,
) -> Result<(ProcedureId, Option<Output>)> {
let context = self.create_context();
let procedure = TruncateTableProcedure::new(
truncate_table_task,
table_info_value,
region_routes,
context,
);
let procedure = TruncateTableProcedure::new(truncate_table_task, table_info_value, context);
let procedure_with_id = ProcedureWithId::with_random_id(Box::new(procedure));
@@ -658,19 +651,26 @@ async fn handle_truncate_table_task(
let table_metadata_manager = &ddl_manager.table_metadata_manager();
let table_ref = truncate_table_task.table_ref();
let (table_info_value, table_route_value) =
table_metadata_manager.get_full_table_info(table_id).await?;
let table_info_value = table_info_value.with_context(|| TableInfoNotFoundSnafu {
table: table_ref.to_string(),
})?;
let table_route_value = table_route_value.context(TableRouteNotFoundSnafu { table_id })?;
let table_route = table_route_value.into_inner().region_routes()?.clone();
let table_info_value = table_metadata_manager
.table_info_manager()
.get(table_id)
.await?
.with_context(|| TableInfoNotFoundSnafu {
table: table_ref.to_string(),
})?;
let physical_table_id = table_metadata_manager
.table_route_manager()
.get_physical_table_id(table_id)
.await?;
ensure!(
physical_table_id == table_id,
error::UnexpectedSnafu {
err_msg: "Truncate table is only supported for physical tables."
}
);
let (id, _) = ddl_manager
.submit_truncate_table_task(truncate_table_task, table_info_value, table_route)
.submit_truncate_table_task(truncate_table_task, table_info_value)
.await?;
info!("Table: {table_id} is truncated via procedure_id {id:?}");

View File

@@ -663,7 +663,7 @@ impl TableMetadataManager {
if let Some(table_route_value) = &mut table_route_value {
self.table_route_manager()
.table_route_storage()
.remap_route_address(table_route_value)
.remap_table_route(table_route_value)
.await?;
}
Ok((table_info_value, table_route_value))

View File

@@ -675,7 +675,7 @@ impl TableRouteStorage {
pub async fn get(&self, table_id: TableId) -> Result<Option<TableRouteValue>> {
let mut table_route = self.get_inner(table_id).await?;
if let Some(table_route) = &mut table_route {
self.remap_route_address(table_route).await?;
self.remap_table_route(table_route).await?;
};
Ok(table_route)
@@ -697,7 +697,7 @@ impl TableRouteStorage {
) -> Result<Option<DeserializedValueWithBytes<TableRouteValue>>> {
let mut table_route = self.get_with_raw_bytes_inner(table_id).await?;
if let Some(table_route) = &mut table_route {
self.remap_route_address(table_route).await?;
self.remap_table_route(table_route).await?;
};
Ok(table_route)
@@ -791,10 +791,7 @@ impl TableRouteStorage {
Ok(())
}
pub(crate) async fn remap_route_address(
&self,
table_route: &mut TableRouteValue,
) -> Result<()> {
pub(crate) async fn remap_table_route(&self, table_route: &mut TableRouteValue) -> Result<()> {
let keys = extract_address_keys(table_route).into_iter().collect();
let node_addrs = self.get_node_addresses(keys).await?;
set_addresses(&node_addrs, table_route)?;
@@ -802,6 +799,17 @@ impl TableRouteStorage {
Ok(())
}
pub(crate) async fn remap_region_routes(
&self,
region_routes: &mut [RegionRoute],
) -> Result<()> {
let keys = extract_address_keys_from_region_routes(region_routes)
.into_iter()
.collect();
let node_addrs = self.get_node_addresses(keys).await?;
set_addresses_for_region_routes(&node_addrs, region_routes)
}
async fn get_node_addresses(
&self,
keys: Vec<Vec<u8>>,
@@ -824,15 +832,11 @@ impl TableRouteStorage {
}
}
fn set_addresses(
fn set_addresses_for_region_routes(
node_addrs: &HashMap<u64, NodeAddressValue>,
table_route: &mut TableRouteValue,
region_routes: &mut [RegionRoute],
) -> Result<()> {
let TableRouteValue::Physical(physical_table_route) = table_route else {
return Ok(());
};
for region_route in &mut physical_table_route.region_routes {
for region_route in region_routes {
if let Some(leader) = &mut region_route.leader_peer
&& let Some(node_addr) = node_addrs.get(&leader.id)
{
@@ -848,13 +852,18 @@ fn set_addresses(
Ok(())
}
fn extract_address_keys(table_route: &TableRouteValue) -> HashSet<Vec<u8>> {
fn set_addresses(
node_addrs: &HashMap<u64, NodeAddressValue>,
table_route: &mut TableRouteValue,
) -> Result<()> {
let TableRouteValue::Physical(physical_table_route) = table_route else {
return HashSet::default();
return Ok(());
};
set_addresses_for_region_routes(node_addrs, &mut physical_table_route.region_routes)
}
physical_table_route
.region_routes
fn extract_address_keys_from_region_routes(region_routes: &[RegionRoute]) -> HashSet<Vec<u8>> {
region_routes
.iter()
.flat_map(|region_route| {
region_route
@@ -871,6 +880,14 @@ fn extract_address_keys(table_route: &TableRouteValue) -> HashSet<Vec<u8>> {
.collect()
}
fn extract_address_keys(table_route: &TableRouteValue) -> HashSet<Vec<u8>> {
let TableRouteValue::Physical(physical_table_route) = table_route else {
return HashSet::default();
};
extract_address_keys_from_region_routes(&physical_table_route.region_routes)
}
#[cfg(test)]
mod tests {
use std::sync::Arc;
@@ -1104,7 +1121,7 @@ mod tests {
.unwrap();
table_route_storage
.remap_route_address(&mut table_route)
.remap_table_route(&mut table_route)
.await
.unwrap();

View File

@@ -15,14 +15,18 @@
use std::marker::PhantomData;
use std::sync::Arc;
use common_telemetry::debug;
use snafu::ResultExt;
use common_telemetry::{debug, info, warn};
use lazy_static::lazy_static;
use regex::Regex;
use snafu::{OptionExt, ResultExt};
use sqlx::mysql::MySqlRow;
use sqlx::pool::Pool;
use sqlx::{MySql, MySqlPool, Row, Transaction as MySqlTransaction};
use strum::AsRefStr;
use crate::error::{CreateMySqlPoolSnafu, MySqlExecutionSnafu, MySqlTransactionSnafu, Result};
use crate::error::{
CreateMySqlPoolSnafu, MySqlExecutionSnafu, MySqlTransactionSnafu, Result, UnexpectedSnafu,
};
use crate::kv_backend::KvBackendRef;
use crate::kv_backend::rds::{
Executor, ExecutorFactory, ExecutorImpl, KvQueryExecutor, RDS_STORE_OP_BATCH_DELETE,
@@ -37,6 +41,18 @@ use crate::rpc::store::{
const MYSQL_STORE_NAME: &str = "mysql_store";
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
enum ValueBlobType {
Blob,
MediumBlob,
LongBlob,
}
lazy_static! {
static ref VALUE_COLUMN_BLOB_TYPE_RE: Regex =
Regex::new(r#"(?i)(?:\(|,)\s*[`"]?v[`"]?\s+(longblob|mediumblob|blob)\b"#).unwrap();
}
type MySqlClient = Arc<Pool<MySql>>;
pub struct MySqlTxnClient(MySqlTransaction<'static, MySql>);
@@ -161,7 +177,11 @@ impl<'a> MySqlTemplateFactory<'a> {
table_name: table_name.to_string(),
create_table_statement: format!(
// Cannot be more than 3072 bytes in PRIMARY KEY
"CREATE TABLE IF NOT EXISTS `{table_name}`(k VARBINARY(3072) PRIMARY KEY, v BLOB);",
"CREATE TABLE IF NOT EXISTS `{table_name}`(k VARBINARY(3072) PRIMARY KEY, v MEDIUMBLOB);",
),
show_create_table_statement: format!("SHOW CREATE TABLE `{table_name}`"),
alter_value_column_statement: format!(
"ALTER TABLE `{table_name}` MODIFY COLUMN v MEDIUMBLOB;"
),
range_template: RangeTemplate {
point: format!("SELECT k, v FROM `{table_name}` WHERE k = ?"),
@@ -186,6 +206,8 @@ impl<'a> MySqlTemplateFactory<'a> {
pub struct MySqlTemplateSet {
table_name: String,
create_table_statement: String,
show_create_table_statement: String,
alter_value_column_statement: String,
range_template: RangeTemplate,
delete_template: RangeTemplate,
}
@@ -534,6 +556,68 @@ impl KvQueryExecutor<MySqlClient> for MySqlStore {
}
impl MySqlStore {
/// Reads the current table definition for best-effort schema upgrades.
async fn fetch_create_table_sql(
pool: &Pool<MySql>,
sql_template_set: &MySqlTemplateSet,
) -> Result<Option<String>> {
let row = sqlx::query(&sql_template_set.show_create_table_statement)
.fetch_optional(pool)
.await
.with_context(|_| MySqlExecutionSnafu {
sql: sql_template_set.show_create_table_statement.clone(),
})?;
Ok(row.map(|row| row.get(1)))
}
/// Parses the blob type of the `v` column from `SHOW CREATE TABLE` output.
fn parse_value_column_blob_type(create_table_sql: &str) -> Option<ValueBlobType> {
// `SHOW CREATE TABLE` returns MySQL-specific DDL. A minimal parser keeps the
// upgrade check small and avoids introducing a SQL parser just for one column.
let captures = VALUE_COLUMN_BLOB_TYPE_RE.captures(create_table_sql)?;
match captures.get(1)?.as_str().to_ascii_lowercase().as_str() {
"blob" => Some(ValueBlobType::Blob),
"mediumblob" => Some(ValueBlobType::MediumBlob),
"longblob" => Some(ValueBlobType::LongBlob),
_ => None,
}
}
/// Upgrades the metadata value column to `MEDIUMBLOB` when an old table still uses `BLOB`.
async fn maybe_upgrade_value_column_to_mediumblob(
pool: &Pool<MySql>,
sql_template_set: &MySqlTemplateSet,
) -> Result<()> {
let table_name = &sql_template_set.table_name;
let create_table_sql = Self::fetch_create_table_sql(pool, sql_template_set)
.await?
.context(UnexpectedSnafu {
err_msg: format!("Failed to fetch CREATE TABLE SQL for `{table_name}`"),
})?;
match Self::parse_value_column_blob_type(&create_table_sql) {
Some(ValueBlobType::Blob) => {
sqlx::query(&sql_template_set.alter_value_column_statement)
.execute(pool)
.await
.with_context(|_| MySqlExecutionSnafu {
sql: sql_template_set.alter_value_column_statement.clone(),
})?;
info!("Upgraded MySQL metadata value column to MEDIUMBLOB for `{table_name}`");
}
Some(ValueBlobType::MediumBlob | ValueBlobType::LongBlob) => {
debug!("MySQL metadata value column for `{table_name}` is already compatible");
}
None => {
warn!(
"Failed to determine MySQL metadata value column type from table definition for `{table_name}`, skip automatic MEDIUMBLOB upgrade"
);
}
}
Ok(())
}
/// Create [MySqlStore] impl of [KvBackendRef] from url.
pub async fn with_url(url: &str, table_name: &str, max_txn_ops: usize) -> Result<KvBackendRef> {
let pool = MySqlPool::connect(url)
@@ -558,6 +642,7 @@ impl MySqlStore {
.with_context(|_| MySqlExecutionSnafu {
sql: sql_template_set.create_table_statement.clone(),
})?;
Self::maybe_upgrade_value_column_to_mediumblob(&pool, &sql_template_set).await?;
Ok(Arc::new(MySqlStore {
max_txn_ops,
sql_template_set,
@@ -574,6 +659,7 @@ impl MySqlStore {
mod tests {
use common_telemetry::init_default_ut_logging;
use sqlx::mysql::{MySqlConnectOptions, MySqlSslMode};
use uuid::Uuid;
use super::*;
use crate::kv_backend::test::{
@@ -585,15 +671,45 @@ mod tests {
text_txn_multi_compare_op, unprepare_kv,
};
use crate::maybe_skip_mysql_integration_test;
use crate::rpc::store::{PutRequest, RangeRequest};
use crate::test_util::test_certs_dir;
async fn build_mysql_kv_backend(table_name: &str) -> Option<MySqlStore> {
fn new_test_table_name(prefix: &str) -> String {
let uuid = Uuid::new_v4().simple().to_string();
let max_prefix_len = 63usize.saturating_sub(uuid.len() + 1);
let prefix = &prefix[..prefix.len().min(max_prefix_len)];
format!("{prefix}_{uuid}")
}
async fn mysql_pool() -> Option<MySqlPool> {
init_default_ut_logging();
let endpoints = std::env::var("GT_MYSQL_ENDPOINTS").unwrap_or_default();
if endpoints.is_empty() {
return None;
}
let pool = MySqlPool::connect(&endpoints).await.unwrap();
Some(MySqlPool::connect(&endpoints).await.unwrap())
}
async fn show_create_table(pool: &MySqlPool, table_name: &str) -> String {
let sql = format!("SHOW CREATE TABLE `{table_name}`");
let row = sqlx::query(&sql).fetch_one(pool).await.unwrap();
row.get::<String, _>(1)
}
async fn create_legacy_blob_table(pool: &MySqlPool, table_name: &str) {
let sql = format!(
"CREATE TABLE IF NOT EXISTS `{table_name}`(k VARBINARY(3072) PRIMARY KEY, v BLOB);"
);
sqlx::query(&sql).execute(pool).await.unwrap();
}
async fn drop_table(pool: &MySqlPool, table_name: &str) {
let sql = format!("DROP TABLE IF EXISTS `{table_name}`;");
sqlx::query(&sql).execute(pool).await.unwrap();
}
async fn build_mysql_kv_backend(table_name: &str) -> Option<MySqlStore> {
let pool = mysql_pool().await?;
let sql_templates = MySqlTemplateFactory::new(table_name).build();
sqlx::query(&sql_templates.create_table_statement)
.execute(&pool)
@@ -610,6 +726,156 @@ mod tests {
})
}
#[test]
fn test_parse_value_column_blob_type() {
let sql = r#"CREATE TABLE `greptime_metakv` (
`k` varbinary(3072) NOT NULL,
`v` MEDIUMBLOB,
PRIMARY KEY (`k`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4"#;
assert_eq!(
Some(ValueBlobType::MediumBlob),
MySqlStore::parse_value_column_blob_type(sql)
);
let sql = r#"CREATE TABLE `greptime_metakv` (`k` varbinary(3072) NOT NULL, `v` blob, PRIMARY KEY (`k`))"#;
assert_eq!(
Some(ValueBlobType::Blob),
MySqlStore::parse_value_column_blob_type(sql)
);
let sql = r#"CREATE TABLE `greptime_metakv` (`k` varbinary(3072) NOT NULL, `v` longblob, PRIMARY KEY (`k`))"#;
assert_eq!(
Some(ValueBlobType::LongBlob),
MySqlStore::parse_value_column_blob_type(sql)
);
let sql = "CREATE TABLE `greptime_metakv` (`k` varbinary(3072) NOT NULL, `v` BLOB NOT NULL, PRIMARY KEY (`k`))";
assert_eq!(
Some(ValueBlobType::Blob),
MySqlStore::parse_value_column_blob_type(sql)
);
let sql = "CREATE TABLE `greptime_metakv` (\n `k` varbinary(3072) NOT NULL,\n \"v\" MediumBlob,\n PRIMARY KEY (`k`)\n)";
assert_eq!(
Some(ValueBlobType::MediumBlob),
MySqlStore::parse_value_column_blob_type(sql)
);
let sql = "CREATE TABLE `greptime_metakv` (`k` varbinary(3072) NOT NULL, vv blob, `v` longblob, PRIMARY KEY (`k`))";
assert_eq!(
Some(ValueBlobType::LongBlob),
MySqlStore::parse_value_column_blob_type(sql)
);
let sql = "CREATE TABLE `greptime_metakv` (`k` varbinary(3072) NOT NULL, value blob, PRIMARY KEY (`k`))";
assert_eq!(None, MySqlStore::parse_value_column_blob_type(sql));
let sql = "CREATE TABLE `greptime_metakv` (`k` varbinary(3072) NOT NULL, `v` varchar(255), PRIMARY KEY (`k`))";
assert_eq!(None, MySqlStore::parse_value_column_blob_type(sql));
let sql =
"CREATE TABLE `greptime_metakv` (`k` varbinary(3072) NOT NULL, PRIMARY KEY (`k`))";
assert_eq!(None, MySqlStore::parse_value_column_blob_type(sql));
}
#[tokio::test]
async fn test_mysql_new_metadata_table_uses_mediumblob() {
maybe_skip_mysql_integration_test!();
let pool = mysql_pool().await.unwrap();
let table_name = new_test_table_name("test_mysql_mediumblob_schema");
MySqlStore::with_mysql_pool(pool.clone(), &table_name, 128)
.await
.unwrap();
let create_table = show_create_table(&pool, &table_name).await;
assert!(create_table.to_ascii_uppercase().contains("MEDIUMBLOB"));
drop_table(&pool, &table_name).await;
}
#[tokio::test]
async fn test_mysql_legacy_blob_metadata_table_is_upgraded() {
maybe_skip_mysql_integration_test!();
let pool = mysql_pool().await.unwrap();
let table_name = new_test_table_name("test_mysql_legacy_blob_upgrade");
create_legacy_blob_table(&pool, &table_name).await;
MySqlStore::with_mysql_pool(pool.clone(), &table_name, 128)
.await
.unwrap();
let create_table = show_create_table(&pool, &table_name).await;
assert!(create_table.to_ascii_uppercase().contains("MEDIUMBLOB"));
drop_table(&pool, &table_name).await;
}
#[tokio::test]
async fn test_mysql_metadata_table_stores_large_values() {
maybe_skip_mysql_integration_test!();
let pool = mysql_pool().await.unwrap();
let table_name = new_test_table_name("test_mysql_large_metadata_value");
let kv_backend = MySqlStore::with_mysql_pool(pool.clone(), &table_name, 128)
.await
.unwrap();
let key = b"large-value".to_vec();
let value = vec![b'x'; 70 * 1024];
kv_backend
.put(
PutRequest::new()
.with_key(key.clone())
.with_value(value.clone()),
)
.await
.unwrap();
let response = kv_backend
.range(RangeRequest::new().with_key(key.clone()))
.await
.unwrap();
assert_eq!(1, response.kvs.len());
assert_eq!(key, response.kvs[0].key);
assert_eq!(value, response.kvs[0].value);
drop_table(&pool, &table_name).await;
}
#[tokio::test]
async fn test_mysql_upgraded_metadata_table_stores_large_values() {
maybe_skip_mysql_integration_test!();
let pool = mysql_pool().await.unwrap();
let table_name = new_test_table_name("test_mysql_upgraded_large_metadata_value");
create_legacy_blob_table(&pool, &table_name).await;
let kv_backend = MySqlStore::with_mysql_pool(pool.clone(), &table_name, 128)
.await
.unwrap();
let key = b"large-value".to_vec();
let value = vec![b'y'; 70 * 1024];
kv_backend
.put(
PutRequest::new()
.with_key(key.clone())
.with_value(value.clone()),
)
.await
.unwrap();
let response = kv_backend
.range(RangeRequest::new().with_key(key.clone()))
.await
.unwrap();
assert_eq!(1, response.kvs.len());
assert_eq!(key, response.kvs[0].key);
assert_eq!(value, response.kvs[0].value);
drop_table(&pool, &table_name).await;
}
#[tokio::test]
async fn test_mysql_put() {
maybe_skip_mysql_integration_test!();

View File

@@ -12,9 +12,11 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::HashSet;
use std::collections::hash_map::Entry;
use std::collections::{HashMap, HashSet};
use std::sync::{Arc, RwLock};
use store_api::region_engine::RegionRole;
use store_api::storage::RegionId;
use crate::DatanodeId;
@@ -24,7 +26,7 @@ use crate::DatanodeId;
pub struct OperatingRegionGuard {
datanode_id: DatanodeId,
region_id: RegionId,
inner: Arc<RwLock<HashSet<(DatanodeId, RegionId)>>>,
inner: Arc<RwLock<HashMap<(DatanodeId, RegionId), RegionRole>>>,
}
impl Drop for OperatingRegionGuard {
@@ -50,7 +52,7 @@ pub type MemoryRegionKeeperRef = Arc<MemoryRegionKeeper>;
/// - Tracks the deleting regions after the corresponding metadata is deleted.
#[derive(Debug, Clone, Default)]
pub struct MemoryRegionKeeper {
inner: Arc<RwLock<HashSet<(DatanodeId, RegionId)>>>,
inner: Arc<RwLock<HashMap<(DatanodeId, RegionId), RegionRole>>>,
}
impl MemoryRegionKeeper {
@@ -59,40 +61,48 @@ impl MemoryRegionKeeper {
}
/// Returns [OperatingRegionGuard] if Region(`region_id`) on Peer(`datanode_id`) does not exist.
pub fn register(
pub fn register_with_role(
&self,
datanode_id: DatanodeId,
region_id: RegionId,
role: RegionRole,
) -> Option<OperatingRegionGuard> {
let mut inner = self.inner.write().unwrap();
if inner.insert((datanode_id, region_id)) {
Some(OperatingRegionGuard {
datanode_id,
region_id,
inner: self.inner.clone(),
})
} else {
None
match inner.entry((datanode_id, region_id)) {
Entry::Occupied(_) => None,
Entry::Vacant(vacant_entry) => {
vacant_entry.insert(role);
Some(OperatingRegionGuard {
datanode_id,
region_id,
inner: self.inner.clone(),
})
}
}
}
/// Returns true if the keeper contains a (`datanoe_id`, `region_id`) tuple.
pub fn contains(&self, datanode_id: DatanodeId, region_id: RegionId) -> bool {
let inner = self.inner.read().unwrap();
inner.contains(&(datanode_id, region_id))
inner.contains_key(&(datanode_id, region_id))
}
/// Extracts all operating regions from `region_ids` and returns operating regions.
pub fn extract_operating_regions(
/// Extracts all operating regions with roles from `region_ids`.
pub fn extract_operating_region_roles(
&self,
datanode_id: DatanodeId,
region_ids: &mut HashSet<RegionId>,
) -> HashSet<RegionId> {
region_ids: &HashSet<RegionId>,
) -> HashMap<RegionId, RegionRole> {
let inner = self.inner.read().unwrap();
region_ids
.extract_if(|region_id| inner.contains(&(datanode_id, *region_id)))
.collect::<HashSet<_>>()
.iter()
.filter_map(|region_id| {
inner
.get(&(datanode_id, *region_id))
.map(|role| (*region_id, *role))
})
.collect()
}
/// Returns number of element in tracking set.
@@ -115,8 +125,9 @@ impl MemoryRegionKeeper {
#[cfg(test)]
mod tests {
use std::collections::HashSet;
use std::collections::{HashMap, HashSet};
use store_api::region_engine::RegionRole;
use store_api::storage::RegionId;
use crate::region_keeper::MemoryRegionKeeper;
@@ -125,20 +136,43 @@ mod tests {
fn test_opening_region_keeper() {
let keeper = MemoryRegionKeeper::new();
let guard = keeper.register(1, RegionId::from_u64(1)).unwrap();
assert!(keeper.register(1, RegionId::from_u64(1)).is_none());
let guard2 = keeper.register(1, RegionId::from_u64(2)).unwrap();
let guard = keeper
.register_with_role(1, RegionId::from_u64(1), RegionRole::Leader)
.unwrap();
assert!(
keeper
.register_with_role(1, RegionId::from_u64(1), RegionRole::Leader)
.is_none()
);
let guard2 = keeper
.register_with_role(1, RegionId::from_u64(2), RegionRole::Follower)
.unwrap();
let mut regions = HashSet::from([
let regions = HashSet::from([
RegionId::from_u64(1),
RegionId::from_u64(2),
RegionId::from_u64(3),
]);
let output = keeper.extract_operating_regions(1, &mut regions);
let output = keeper.extract_operating_region_roles(1, &regions);
assert_eq!(output.len(), 2);
assert!(output.contains(&RegionId::from_u64(1)));
assert!(output.contains(&RegionId::from_u64(2)));
assert!(output.contains_key(&RegionId::from_u64(1)));
assert!(output.contains_key(&RegionId::from_u64(2)));
assert_eq!(keeper.len(), 2);
let regions = HashSet::from([
RegionId::from_u64(1),
RegionId::from_u64(2),
RegionId::from_u64(3),
]);
let output = keeper.extract_operating_region_roles(1, &regions);
assert_eq!(
output,
HashMap::from([
(RegionId::from_u64(1), RegionRole::Leader),
(RegionId::from_u64(2), RegionRole::Follower),
])
);
assert_eq!(keeper.len(), 2);
drop(guard);

View File

@@ -23,6 +23,7 @@ use derive_builder::Builder;
use serde::ser::SerializeSeq;
use serde::{Deserialize, Deserializer, Serialize, Serializer};
use snafu::OptionExt;
use store_api::region_engine::RegionRole;
use store_api::storage::{RegionId, RegionNumber};
use strum::AsRefStr;
use table::table_name::TableName;
@@ -99,6 +100,20 @@ pub fn operating_leader_regions(region_routes: &[RegionRoute]) -> Vec<(RegionId,
.collect::<Vec<_>>()
}
/// Returns the operating leader regions with corresponding [DatanodeId] and [RegionRole].
pub fn operating_leader_region_roles(
region_routes: &[RegionRoute],
) -> Vec<(RegionId, DatanodeId, RegionRole)> {
region_routes
.iter()
.filter_map(|route| {
let role = route.leader_region_role()?;
let leader = route.leader_peer.as_ref()?;
Some((route.region.id, leader.id, role))
})
.collect()
}
/// Returns the HashMap<[RegionNumber], &[Peer]>;
///
/// If the region doesn't have a leader peer, the [Region] will be omitted.
@@ -342,6 +357,19 @@ impl RegionRoute {
matches!(self.leader_state, Some(LeaderState::Staging))
}
/// Returns the role of the leader region.
pub fn leader_region_role(&self) -> Option<RegionRole> {
self.leader_peer.as_ref().map(|_| {
if self.is_leader_staging() {
RegionRole::StagingLeader
} else if self.is_leader_downgrading() {
RegionRole::DowngradingLeader
} else {
RegionRole::Leader
}
})
}
/// Marks the Leader [`Region`] as [`RegionState::Downgrading`].
///
/// We should downgrade a [`Region`] before deactivating it:
@@ -577,6 +605,17 @@ mod tests {
use super::*;
use crate::key::RegionRoleSet;
fn new_test_region_route(region_id: RegionId) -> RegionRoute {
RegionRoute {
region: Region::new_test(region_id),
leader_peer: Some(Peer::new(1, "a1")),
follower_peers: vec![Peer::new(2, "a2")],
leader_state: None,
leader_down_since: None,
write_route_policy: None,
}
}
#[test]
fn test_leader_is_downgraded() {
let mut region_route = RegionRoute {
@@ -740,6 +779,65 @@ mod tests {
assert!(!region_route.is_ignore_all_writes());
}
#[test]
fn test_leader_region_role_without_leader_peer_returns_none() {
let region_route = RegionRoute {
leader_peer: None,
..new_test_region_route(RegionId::new(1, 1))
};
assert_eq!(region_route.leader_region_role(), None);
}
#[test]
fn test_leader_region_role_variants() {
let normal = new_test_region_route(RegionId::new(1, 1));
let mut downgrading = new_test_region_route(RegionId::new(1, 2));
downgrading.leader_state = Some(LeaderState::Downgrading);
let mut staging = new_test_region_route(RegionId::new(1, 3));
staging.leader_state = Some(LeaderState::Staging);
assert_eq!(normal.leader_region_role(), Some(RegionRole::Leader));
assert_eq!(
downgrading.leader_region_role(),
Some(RegionRole::DowngradingLeader)
);
assert_eq!(
staging.leader_region_role(),
Some(RegionRole::StagingLeader)
);
}
#[test]
fn test_operating_leader_region_roles_returns_expected_roles() {
let no_leader_region = RegionRoute {
leader_peer: None,
..new_test_region_route(RegionId::new(1, 4))
};
let mut downgrading = new_test_region_route(RegionId::new(1, 2));
downgrading.leader_peer = Some(Peer::new(2, "a2"));
downgrading.leader_state = Some(LeaderState::Downgrading);
let mut staging = new_test_region_route(RegionId::new(1, 3));
staging.leader_peer = Some(Peer::new(3, "a3"));
staging.leader_state = Some(LeaderState::Staging);
let roles = operating_leader_region_roles(&[
new_test_region_route(RegionId::new(1, 1)),
downgrading,
staging,
no_leader_region,
]);
assert_eq!(
roles,
vec![
(RegionId::new(1, 1), 1, RegionRole::Leader),
(RegionId::new(1, 2), 2, RegionRole::DowngradingLeader),
(RegionId::new(1, 3), 3, RegionRole::StagingLeader),
]
);
}
#[test]
fn test_region_distribution() {
let region_routes = vec![

View File

@@ -27,3 +27,10 @@ pub type PluginOptionsSerializerRef = Arc<dyn PluginOptionsSerializer>;
pub trait PluginOptionsDeserializer<T: DeserializeOwned>: Send + Sync {
fn deserialize(&self, payload: &str) -> Result<T, serde_json::Error>;
}
/// A flag for stating the standalone mode in the plugins.
///
/// The standalone build and start process calls `setup_frontend_plugins` and `setup_datanode_plugins`,
/// so we add a flag to the plugins to indicate that the plugins are running in the standalone mode.
#[derive(Clone, Copy, Debug)]
pub struct StandaloneFlag;

View File

@@ -1293,6 +1293,60 @@ mod tests {
.await;
}
#[tokio::test]
async fn test_retrying_state_visible_in_context_on_retry() {
let retrying_states = Arc::new(std::sync::Mutex::new(Vec::new()));
let captured = retrying_states.clone();
let mut times = 0;
let exec_fn = move |ctx: Context| {
times += 1;
let captured = captured.clone();
async move {
let is_retrying = ctx.is_retrying().await;
captured.lock().unwrap().push(is_retrying);
if times == 1 {
Err(Error::retry_later(MockError::new(StatusCode::Unexpected)))
} else {
Ok(Status::done())
}
}
.boxed()
};
let procedure = ProcedureAdapter {
data: "retrying_state".to_string(),
lock_key: LockKey::single_exclusive("catalog.schema.table"),
poison_keys: PoisonKeys::default(),
exec_fn,
rollback_fn: None,
};
let dir = create_temp_dir("retrying_state");
let meta = procedure.new_meta(ROOT_ID);
let object_store = test_util::new_object_store(&dir);
let procedure_store = Arc::new(ProcedureStore::from_object_store(object_store));
let mut runner = new_runner(meta.clone(), Box::new(procedure), procedure_store);
let ctx = context_with_provider(
meta.id,
runner.manager_ctx.clone() as Arc<dyn ContextProvider>,
);
runner
.manager_ctx
.procedures
.write()
.unwrap()
.insert(meta.id, runner.meta.clone());
runner.manager_ctx.start();
runner.execute_once(&ctx).await;
runner.execute_once(&ctx).await;
let states = retrying_states.lock().unwrap().clone();
assert_eq!(states, vec![Some(false), Some(true)]);
}
#[tokio::test(flavor = "multi_thread")]
async fn test_execute_on_retry_later_error_with_child() {
common_telemetry::init_default_ut_logging();

View File

@@ -177,6 +177,18 @@ pub struct Context {
pub provider: ContextProviderRef,
}
impl Context {
/// Returns true if current procedure state is retrying.
pub async fn is_retrying(&self) -> Option<bool> {
self.provider
.procedure_state(self.procedure_id)
.await
.ok()
.flatten()
.map(|s| s.is_retrying())
}
}
/// A `Procedure` represents an operation or a set of operations to be performed step-by-step.
#[async_trait]
pub trait Procedure: Send {

View File

@@ -503,6 +503,7 @@ mod test {
use mito2::config::MitoConfig;
use mito2::test_util::{CreateRequestBuilder, TestEnv};
use store_api::region_engine::RegionEngine;
use store_api::region_request::{EnterStagingRequest, StagingPartitionDirective};
use super::*;
use crate::tests::mock_region_server;
@@ -621,4 +622,141 @@ mod test {
> Instant::now() + Duration::from_millis(heartbeat_interval_millis * 4)
);
}
#[tokio::test(flavor = "multi_thread")]
async fn renew_staging_leader_keeps_region_in_staging() {
let mut region_server = mock_region_server();
let mut engine_env = TestEnv::with_prefix("region-alive-keeper-staging").await;
let engine = engine_env.create_engine(MitoConfig::default()).await;
let engine = Arc::new(engine);
region_server.register_engine(engine.clone());
let alive_keeper = Arc::new(RegionAliveKeeper::new(
region_server.clone(),
None,
Duration::from_millis(100),
));
let region_id = RegionId::new(1024, 2);
region_server
.handle_request(
region_id,
RegionRequest::Create(CreateRequestBuilder::new().build()),
)
.await
.unwrap();
region_server
.handle_request(
region_id,
RegionRequest::EnterStaging(EnterStagingRequest {
partition_directive: StagingPartitionDirective::RejectAllWrites,
}),
)
.await
.unwrap();
alive_keeper.register_region(region_id).await;
alive_keeper
.renew_region_leases(
&[GrantedRegion {
region_id: region_id.as_u64(),
role: api::v1::meta::RegionRole::StagingLeader.into(),
extensions: HashMap::new(),
}],
Instant::now() + Duration::from_millis(3000),
)
.await;
assert_eq!(engine.role(region_id).unwrap(), RegionRole::StagingLeader);
}
#[tokio::test(flavor = "multi_thread")]
async fn renew_staging_leader_exit_into_leader() {
common_telemetry::init_default_ut_logging();
let mut region_server = mock_region_server();
let mut engine_env = TestEnv::with_prefix("region-alive-keeper-staging-exit").await;
let engine = engine_env.create_engine(MitoConfig::default()).await;
let engine = Arc::new(engine);
region_server.register_engine(engine.clone());
let alive_keeper = Arc::new(RegionAliveKeeper::new(
region_server.clone(),
None,
Duration::from_millis(100),
));
let region_id = RegionId::new(1024, 2);
region_server
.handle_request(
region_id,
RegionRequest::Create(CreateRequestBuilder::new().build()),
)
.await
.unwrap();
region_server
.handle_request(
region_id,
RegionRequest::EnterStaging(EnterStagingRequest {
partition_directive: StagingPartitionDirective::RejectAllWrites,
}),
)
.await
.unwrap();
alive_keeper.register_region(region_id).await;
alive_keeper
.renew_region_leases(
&[GrantedRegion {
region_id: region_id.as_u64(),
role: api::v1::meta::RegionRole::Leader.into(),
extensions: HashMap::new(),
}],
Instant::now() + Duration::from_millis(3000),
)
.await;
tokio::time::sleep(Duration::from_millis(100)).await;
assert_eq!(engine.role(region_id).unwrap(), RegionRole::Leader);
}
#[tokio::test(flavor = "multi_thread")]
async fn renew_staging_leader_does_not_promote_normal_leader_into_staging() {
let mut region_server = mock_region_server();
let mut engine_env = TestEnv::with_prefix("region-alive-keeper-non-staging").await;
let engine = engine_env.create_engine(MitoConfig::default()).await;
let engine = Arc::new(engine);
region_server.register_engine(engine.clone());
let alive_keeper = Arc::new(RegionAliveKeeper::new(
region_server.clone(),
None,
Duration::from_millis(100),
));
let region_id = RegionId::new(1024, 4);
region_server
.handle_request(
region_id,
RegionRequest::Create(CreateRequestBuilder::new().build()),
)
.await
.unwrap();
region_server
.set_region_role(region_id, RegionRole::Leader)
.unwrap();
alive_keeper.register_region(region_id).await;
alive_keeper
.renew_region_leases(
&[GrantedRegion {
region_id: region_id.as_u64(),
role: api::v1::meta::RegionRole::StagingLeader.into(),
extensions: HashMap::new(),
}],
Instant::now() + Duration::from_millis(3000),
)
.await;
tokio::time::sleep(Duration::from_millis(100)).await;
assert_eq!(engine.role(region_id).unwrap(), RegionRole::Leader);
}
}

View File

@@ -148,9 +148,9 @@ impl HeartbeatTask {
let mut follower_region_lease_count = 0;
for lease in &lease.regions {
match lease.role() {
RegionRole::Leader | RegionRole::DowngradingLeader => {
leader_region_lease_count += 1
}
RegionRole::Leader
| RegionRole::StagingLeader
| RegionRole::DowngradingLeader => leader_region_lease_count += 1,
RegionRole::Follower => follower_region_lease_count += 1,
}
}

View File

@@ -360,6 +360,7 @@ impl RegionServer {
engine.role(region_id).map(|role| match role {
RegionRole::Follower => false,
RegionRole::Leader => true,
RegionRole::StagingLeader => true,
RegionRole::DowngradingLeader => true,
})
})

View File

@@ -82,7 +82,9 @@ impl Tokenizer for EnglishTokenizer {
/// `ChineseTokenizer` tokenizes a Chinese text.
///
/// It uses the Jieba tokenizer to split the text into Chinese words.
/// It uses Jieba search-mode tokenization to improve recall for Chinese fulltext search.
/// Enabling HMM also helps merge some unknown fragments into larger tokens, which can reduce
/// token cardinality versus a fully fragmented output.
#[derive(Debug, Default)]
pub struct ChineseTokenizer;
@@ -91,11 +93,35 @@ impl Tokenizer for ChineseTokenizer {
if text.is_ascii() {
EnglishTokenizer {}.tokenize(text)
} else {
JIEBA.cut(text, false)
// Search-mode tokenization emits finer-grained searchable terms, while HMM helps
// merge some unknown fragments and avoid excessive token fragmentation.
let mut tokens = JIEBA
.cut_for_search(text, true)
.into_iter()
.filter(|s| is_indexable_token(s))
.collect::<Vec<_>>();
let english = EnglishTokenizer {};
tokens.extend(
english
.tokenize(text)
.into_iter()
.filter(|token| is_ascii_underscore_token(token)),
);
tokens
}
}
}
fn is_indexable_token(token: &str) -> bool {
token.chars().any(|c| c.is_alphanumeric() || c == '_')
}
fn is_ascii_underscore_token(token: &str) -> bool {
token.is_ascii() && token.chars().any(|c| c == '_')
}
/// `Analyzer` analyzes a text into a list of tokens.
///
/// It uses a `Tokenizer` to tokenize the text and optionally lowercases the tokens.
@@ -138,11 +164,26 @@ mod tests {
#[test]
fn test_english_tokenizer() {
let tokenizer = EnglishTokenizer;
let text = "Hello, world!!! This is a----++ test012_345+67890";
let text = "Hello, world!!! This is a----++ test012_345+67890 ship_ship ship__ship _ __ __IDENTIFIER__ _ship ship_";
let tokens = tokenizer.tokenize(text);
assert_eq!(
tokens,
vec!["Hello", "world", "This", "is", "a", "test012_345", "67890"]
vec![
"Hello",
"world",
"This",
"is",
"a",
"test012_345",
"67890",
"ship_ship",
"ship__ship",
"_",
"__",
"__IDENTIFIER__",
"_ship",
"ship_"
]
);
}
@@ -167,6 +208,331 @@ mod tests {
assert_eq!(tokens, vec!["", "喜欢", "苹果"]);
}
#[test]
fn test_chinese_tokenizer_issue_7943_sample() {
let tokenizer = ChineseTokenizer;
let text = "[2026/04/09/ 13:56:11.031]2026-04-09 13:56:11.031 - [ trace_id=340a6a44b0bd8e37bb7697ss7da61ff0 span_id=085ff5ttf1e0a23b trace_flags=01] - [http-nio-8081-exec-16] INFO c.h.p.xx.web.service.impl.CCCXForwardKKKServiceImpl.pushout(188) - 登录手机号18888888888的动态key829889AC8 ship_ship ship__ship _ __ __IDENTIFIER__ _ship ship_ EOF";
let tokens = tokenizer.tokenize(text);
assert_eq!(
tokens,
vec![
"2026",
"04",
"09",
"13",
"56",
"11.031",
"2026-04",
"09",
"13",
"56",
"11.031",
"trace",
"_",
"id",
"340a6a44b0bd8e37bb7697ss7da61ff0",
"span",
"_",
"id",
"085ff5ttf1e0a23b",
"trace",
"_",
"flags",
"01",
"http",
"nio-8081",
"exec-16",
"INFO",
"c",
"h",
"p",
"xx",
"web",
"service",
"impl",
"CCCXForwardKKKServiceImpl",
"pushout",
"188",
"登录",
"手机",
"手机号",
"18888888888",
"",
"动态",
"key",
"829889AC8",
"ship",
"_",
"ship",
"ship",
"__",
"ship",
"_",
"__",
"__",
"IDENTIFIER",
"__",
"_",
"ship",
"ship",
"_",
"EOF",
"trace_id",
"span_id",
"trace_flags",
"ship_ship",
"ship__ship",
"_",
"__",
"__IDENTIFIER__",
"_ship",
"ship_"
]
);
}
#[test]
fn test_chinese_tokenizer_keeps_ascii_underscore_compounds() {
let tokenizer = ChineseTokenizer;
let text = "trace_id=abc 登录手机号 dynamic_key=xyz";
let tokens = tokenizer.tokenize(text);
assert!(tokens.contains(&"trace_id"));
assert!(tokens.contains(&"dynamic_key"));
assert!(tokens.contains(&"登录"));
assert!(tokens.contains(&"手机号"));
}
#[test]
fn test_chinese_tokenizer_skips_non_ascii_underscore_tokens() {
let tokenizer = ChineseTokenizer;
let text = "登录_id trace_id 手机号_trace";
let tokens = tokenizer.tokenize(text);
assert_eq!(
tokens,
[
"登录",
"_",
"id",
"trace",
"_",
"id",
"手机",
"手机号",
"_",
"trace",
"trace_id"
]
);
}
#[test]
fn test_chinese_tokenizer_aggressive_tokenization_probe() {
let tokenizer = ChineseTokenizer;
let text = "哈基米哦南北绿豆,噢马自立曼波。登录手机号。中国农业银行。装电视台,中国中央广播电视台。压不缩,笑不活。";
let default_tokens = tokenizer.tokenize(text);
let cut_hmm_false = JIEBA.cut(text, false);
let cut_hmm_true = JIEBA.cut(text, true);
let cut_for_search_hmm_false = JIEBA.cut_for_search(text, false);
let cut_for_search_hmm_true = JIEBA.cut_for_search(text, true);
assert_eq!(
default_tokens,
[
"哈基米",
"",
"南北",
"绿豆",
"",
"",
"自立",
"曼波",
"登录",
"手机",
"手机号",
"中国",
"农业",
"银行",
"中国农业银行",
"",
"电视",
"电视台",
"中国",
"中央",
"广播",
"电视",
"电视台",
"不缩",
"压不缩",
"",
"不活",
]
);
assert_eq!(
cut_hmm_false,
[
"",
"",
"",
"",
"南北",
"绿豆",
"",
"",
"",
"自立",
"",
"",
"",
"登录",
"手机号",
"",
"中国农业银行",
"",
"",
"电视台",
"",
"中国",
"中央",
"广播",
"电视台",
"",
"",
"",
"",
"",
"",
"",
"",
""
]
);
assert_eq!(
cut_hmm_true,
[
"哈基米",
"",
"南北",
"绿豆",
"",
"",
"",
"自立",
"曼波",
"",
"登录",
"手机号",
"",
"中国农业银行",
"",
"",
"电视台",
"",
"中国",
"中央",
"广播",
"电视台",
"",
"压不缩",
"",
"",
"不活",
""
]
);
assert_eq!(
cut_for_search_hmm_false,
[
"",
"",
"",
"",
"南北",
"绿豆",
"",
"",
"",
"自立",
"",
"",
"",
"登录",
"手机",
"手机号",
"",
"中国",
"农业",
"银行",
"中国农业银行",
"",
"",
"电视",
"电视台",
"",
"中国",
"中央",
"广播",
"电视",
"电视台",
"",
"",
"",
"",
"",
"",
"",
"",
""
]
);
assert_eq!(
cut_for_search_hmm_true,
[
"哈基米",
"",
"南北",
"绿豆",
"",
"",
"",
"自立",
"曼波",
"",
"登录",
"手机",
"手机号",
"",
"中国",
"农业",
"银行",
"中国农业银行",
"",
"",
"电视",
"电视台",
"",
"中国",
"中央",
"广播",
"电视",
"电视台",
"",
"不缩",
"压不缩",
"",
"",
"不活",
""
]
);
}
#[test]
fn test_valid_ascii_token_lookup_table() {
// Test all ASCII values in a single loop

View File

@@ -94,15 +94,15 @@ impl IndexApplier for PredicatesIndexApplier {
.collect::<Vec<_>>();
let mut mapper = ParallelFstValuesMapper::new(reader);
let mut bm_vec = mapper.map_values_vec(&value_and_meta_vec, metrics).await?;
let bm_vec = mapper.map_values_vec(&value_and_meta_vec, metrics).await?;
let mut bitmap = bm_vec.pop().unwrap(); // SAFETY: `fst_ranges` is not empty
for bm in bm_vec {
if bm.count_ones() == 0 {
let mut iter = bm_vec.into_iter();
let mut bitmap = iter.next().unwrap(); // SAFETY: `fst_ranges` is not empty
for bm in iter {
bitmap.intersect(bm);
if bitmap.count_ones() == 0 {
break;
}
bitmap.intersect(bm);
}
output.matched_segment_ids = bitmap;

View File

@@ -129,7 +129,7 @@ impl HeartbeatHandler for CollectDatanodeClusterInfoHandler {
let leader_regions = stat
.region_stats
.iter()
.filter(|s| s.role == RegionRole::Leader)
.filter(|s| matches!(s.role, RegionRole::Leader | RegionRole::StagingLeader))
.count();
let follower_regions = stat.region_stats.len() - leader_regions;

View File

@@ -40,7 +40,7 @@ impl HeartbeatHandler for CollectLeaderRegionHandler {
let mut key_values = Vec::with_capacity(current_stat.region_stats.len());
for stat in current_stat.region_stats.iter() {
if stat.role != RegionRole::Leader {
if !matches!(stat.role, RegionRole::Leader | RegionRole::StagingLeader) {
continue;
}

View File

@@ -121,7 +121,10 @@ fn to_persisted_if_leader(
datanode_id: DatanodeId,
timestamp_millis: i64,
) -> Option<(Row, PersistedRegionStat)> {
if matches!(region_stat.role, RegionRole::Leader) {
if matches!(
region_stat.role,
RegionRole::Leader | RegionRole::StagingLeader
) {
let persisted_region_stat = last_persisted_region_stats.get(&region_stat.id).map(|s| *s);
Some((
compute_persist_region_stat(

View File

@@ -281,7 +281,7 @@ mod test {
let opening_region_id = RegionId::new(table_id, region_number + 2);
let _guard = opening_region_keeper
.register(follower_peer.id, opening_region_id)
.register_with_role(follower_peer.id, opening_region_id, RegionRole::Follower)
.unwrap();
let acc = &mut HeartbeatAccumulator::default();
@@ -398,6 +398,65 @@ mod test {
assert_eq!(acc.inactive_region_ids, HashSet::from([no_exist_region_id]));
}
#[tokio::test]
async fn test_handle_staging_leader() {
let datanode_id = 1;
let region_number = 1u32;
let table_id = 10;
let region_id = RegionId::new(table_id, region_number);
let peer = Peer::empty(datanode_id);
let table_info = new_test_table_info(table_id);
let region_routes = vec![RegionRoute {
region: Region::new_test(region_id),
leader_peer: Some(peer.clone()),
leader_state: Some(LeaderState::Staging),
..Default::default()
}];
let keeper = new_test_keeper();
let table_metadata_manager = keeper.table_metadata_manager();
table_metadata_manager
.create_table_metadata(
table_info,
TableRouteValue::physical(region_routes),
HashMap::default(),
)
.await
.unwrap();
let builder = MetasrvBuilder::new();
let metasrv = builder.build().await.unwrap();
let ctx = &mut metasrv.new_ctx();
let req = HeartbeatRequest {
duration_since_epoch: 1234,
..Default::default()
};
let acc = &mut HeartbeatAccumulator::default();
acc.stat = Some(Stat {
id: peer.id,
region_stats: vec![new_empty_region_stat(region_id, RegionRole::StagingLeader)],
..Default::default()
});
let handler = RegionLeaseHandler::new(
default_distributed_time_constants().region_lease.as_secs(),
table_metadata_manager.clone(),
Default::default(),
None,
);
handler.handle(&req, ctx, acc).await.unwrap();
assert_region_lease(
acc,
vec![GrantedRegion::new(region_id, RegionRole::StagingLeader)],
);
}
fn assert_region_lease(acc: &HeartbeatAccumulator, expected: Vec<GrantedRegion>) {
let region_lease = acc.region_lease.as_ref().unwrap().clone();
let granted: Vec<GrantedRegion> = region_lease

View File

@@ -25,6 +25,7 @@ use common_telemetry::info;
use common_telemetry::tracing_context::TracingContext;
use serde::{Deserialize, Serialize};
use snafu::{OptionExt, ResultExt};
use store_api::region_engine::RegionRole;
use tokio::time::Instant;
use crate::error::{self, Result};
@@ -129,7 +130,7 @@ impl OpenCandidateRegion {
// Registers the opening region.
let guard = ctx
.opening_region_keeper
.register(candidate.id, *region_id)
.register_with_role(candidate.id, *region_id, RegionRole::Follower)
.context(error::RegionOperatingRaceSnafu {
peer_id: candidate.id,
region_id: *region_id,
@@ -296,7 +297,7 @@ mod tests {
let mut ctx = env.context_factory().new_context(persistent_context);
let opening_region_keeper = env.opening_region_keeper();
let _guard = opening_region_keeper
.register(to_peer_id, region_id)
.register_with_role(to_peer_id, region_id, RegionRole::Follower)
.unwrap();
let open_instruction = new_mock_open_instruction(to_peer_id, region_id);

View File

@@ -231,6 +231,7 @@ mod tests {
use common_meta::region_keeper::MemoryRegionKeeper;
use common_meta::rpc::router::{LeaderState, Region, RegionRoute};
use common_time::util::current_time_millis;
use store_api::region_engine::RegionRole;
use store_api::storage::RegionId;
use crate::error::Error;
@@ -467,7 +468,7 @@ mod tests {
}];
let guard = opening_keeper
.register(2, RegionId::new(table_id, 1))
.register_with_role(2, RegionId::new(table_id, 1), RegionRole::Follower)
.unwrap();
ctx.volatile_ctx.opening_region_guards.push(guard);

View File

@@ -42,7 +42,7 @@ use common_meta::lock_key::{CatalogLock, SchemaLock, TableLock, TableNameLock};
use common_meta::node_manager::NodeManagerRef;
use common_meta::region_keeper::{MemoryRegionKeeperRef, OperatingRegionGuard};
use common_meta::region_registry::LeaderRegionRegistryRef;
use common_meta::rpc::router::{RegionRoute, operating_leader_regions};
use common_meta::rpc::router::{RegionRoute, operating_leader_region_roles};
use common_procedure::error::{FromJsonSnafu, ToJsonSnafu};
use common_procedure::{
BoxedProcedure, Context as ProcedureContext, Error as ProcedureError, LockKey, Procedure,
@@ -59,11 +59,13 @@ use crate::error::{self, Result};
use crate::procedure::repartition::collect::ProcedureMeta;
use crate::procedure::repartition::deallocate_region::DeallocateRegion;
use crate::procedure::repartition::group::{
Context as RepartitionGroupContext, RepartitionGroupProcedure,
Context as RepartitionGroupContext, RepartitionGroupProcedure, region_routes,
};
use crate::procedure::repartition::plan::RepartitionPlanEntry;
use crate::procedure::repartition::repartition_start::RepartitionStart;
use crate::procedure::repartition::utils::get_datanode_table_value;
use crate::procedure::repartition::utils::{
get_datanode_table_value, rollback_group_metadata_routes,
};
use crate::service::mailbox::MailboxRef;
#[cfg(test)]
@@ -76,11 +78,17 @@ pub struct PersistentContext {
pub table_name: String,
pub table_id: TableId,
pub plans: Vec<RepartitionPlanEntry>,
/// Records failed sub-procedures for metadata rollback.
#[serde(default)]
/// Records failed sub-procedures for parent rollback selection.
///
/// The parent repartition procedure uses these entries to decide which plans
/// require group-metadata restoration and allocated-region cleanup.
pub failed_procedures: Vec<ProcedureMeta>,
#[serde(default)]
/// Records unknown sub-procedures for metadata rollback.
/// Records unknown sub-procedures for parent rollback selection.
///
/// Unknown procedures are treated the same as failed ones when selecting the
/// plan subset that must be rolled back by the parent procedure.
pub unknown_procedures: Vec<ProcedureMeta>,
/// The timeout for repartition operations.
#[serde(with = "humantime_serde", default = "default_timeout")]
@@ -409,9 +417,9 @@ impl Context {
region_routes: &[RegionRoute],
) -> Result<Vec<OperatingRegionGuard>> {
let mut operating_guards = Vec::with_capacity(region_routes.len());
for (region_id, datanode_id) in operating_leader_regions(region_routes) {
for (region_id, datanode_id, role) in operating_leader_region_roles(region_routes) {
let guard = memory_region_keeper
.register(datanode_id, region_id)
.register_with_role(datanode_id, region_id, role)
.context(error::RegionOperatingRaceSnafu {
peer_id: datanode_id,
region_id,
@@ -506,6 +514,23 @@ impl RepartitionProcedure {
|| self.state.as_any().is::<collect::Collect>()
}
fn rollback_plan_indices(&self) -> HashSet<usize> {
self.context
.persistent_ctx
.failed_procedures
.iter()
.chain(self.context.persistent_ctx.unknown_procedures.iter())
.map(|procedure_meta| procedure_meta.plan_index)
.collect()
}
/// Returns allocated region ids that parent rollback should remove.
///
/// Rollback uses an "after AllocateRegion" semantic:
/// - in `AllocateRegion` and `Dispatch`, all allocated regions belong to the
/// current repartition attempt and must be cleaned up.
/// - in `Collect`, only the plans referenced by failed or unknown
/// sub-procedures should be rolled back.
fn rollback_allocated_region_ids(&self) -> HashSet<store_api::storage::RegionId> {
if self.state.as_any().is::<allocate_region::AllocateRegion>()
|| self.state.as_any().is::<dispatch::Dispatch>()
@@ -519,13 +544,9 @@ impl RepartitionProcedure {
.collect();
}
self.context
.persistent_ctx
.failed_procedures
.iter()
.chain(self.context.persistent_ctx.unknown_procedures.iter())
.flat_map(|procedure_meta| {
let plan_index = procedure_meta.plan_index;
self.rollback_plan_indices()
.into_iter()
.flat_map(|plan_index| {
self.context.persistent_ctx.plans[plan_index]
.allocated_region_ids
.iter()
@@ -534,15 +555,35 @@ impl RepartitionProcedure {
.collect()
}
fn filter_allocated_region_routes(
region_routes: &[RegionRoute],
allocated_region_ids: &HashSet<store_api::storage::RegionId>,
) -> Vec<RegionRoute> {
region_routes
.iter()
.filter(|route| !allocated_region_ids.contains(&route.region.id))
.cloned()
.collect()
/// Restores group-level staging metadata for failed/unknown plans.
///
/// The helper mutates `region_routes` in memory.
async fn rollback_group_metadata_for_selected_plans(
&mut self,
region_routes: &mut [RegionRoute],
) -> Result<()> {
let rollback_plan_indices = self.rollback_plan_indices();
if rollback_plan_indices.is_empty() {
return Ok(());
}
let mut region_routes_map = region_routes
.iter_mut()
.map(|route| (route.region.id, route))
.collect::<HashMap<_, _>>();
for plan_index in rollback_plan_indices {
let plan = &self.context.persistent_ctx.plans[plan_index];
rollback_group_metadata_routes(
plan.group_id,
&plan.source_regions,
&plan.original_target_routes,
&plan.allocated_region_ids,
&plan.pending_deallocate_region_ids,
&mut region_routes_map,
)?;
}
Ok(())
}
async fn rollback_inner(&mut self, procedure_ctx: &ProcedureContext) -> Result<()> {
@@ -552,17 +593,17 @@ impl RepartitionProcedure {
let table_id = self.context.persistent_ctx.table_id;
let allocated_region_ids = self.rollback_allocated_region_ids();
if allocated_region_ids.is_empty() {
return Ok(());
}
let table_lock = TableLock::Write(table_id).into();
let _guard = procedure_ctx.provider.acquire_lock(&table_lock).await;
let table_route_value = self.context.get_table_route_value().await?;
let current_region_routes = table_route_value.region_routes().unwrap();
let original_region_routes = region_routes(table_id, table_route_value.get_inner_ref())?;
let mut current_region_routes = original_region_routes.clone();
self.rollback_group_metadata_for_selected_plans(&mut current_region_routes)
.await?;
let allocated_region_routes = DeallocateRegion::filter_deallocatable_region_routes(
table_id,
current_region_routes,
&current_region_routes,
&allocated_region_ids,
);
if !allocated_region_routes.is_empty() {
@@ -587,9 +628,9 @@ impl RepartitionProcedure {
}
let new_region_routes =
Self::filter_allocated_region_routes(current_region_routes, &allocated_region_ids);
DeallocateRegion::generate_region_routes(&current_region_routes, &allocated_region_ids);
if new_region_routes.len() != current_region_routes.len() {
if new_region_routes != *original_region_routes {
self.context
.update_table_route(&table_route_value, new_region_routes, HashMap::new())
.await
@@ -796,9 +837,11 @@ mod tests {
};
use common_meta::error;
use common_meta::peer::Peer;
use common_meta::rpc::router::{Region, RegionRoute};
use common_meta::region_keeper::MemoryRegionKeeper;
use common_meta::rpc::router::{LeaderState, Region, RegionRoute};
use common_meta::test_util::MockDatanodeManager;
use common_procedure::{Error as ProcedureError, Procedure, ProcedureId, ProcedureState};
use store_api::region_engine::RegionRole;
use store_api::storage::RegionId;
use table::table_name::TableName;
use tokio::sync::mpsc;
@@ -809,6 +852,7 @@ mod tests {
use crate::procedure::repartition::collect::Collect;
use crate::procedure::repartition::deallocate_region::DeallocateRegion;
use crate::procedure::repartition::dispatch::Dispatch;
use crate::procedure::repartition::group::update_metadata::UpdateMetadata;
use crate::procedure::repartition::plan::RegionDescriptor;
use crate::procedure::repartition::repartition_end::RepartitionEnd;
use crate::procedure::repartition::test_util::{
@@ -837,9 +881,52 @@ mod tests {
allocated_region_ids: vec![RegionId::new(table_id, 3)],
pending_deallocate_region_ids: vec![],
transition_map: vec![vec![0, 1]],
original_target_routes: vec![],
}
}
fn with_rollback_metadata(
mut plan: RepartitionPlanEntry,
original_target_routes: Vec<RegionRoute>,
) -> RepartitionPlanEntry {
plan.original_target_routes = original_target_routes;
plan
}
fn apply_group_staging(
plan: &RepartitionPlanEntry,
current_region_routes: &[RegionRoute],
) -> Vec<RegionRoute> {
UpdateMetadata::apply_staging_region_routes(
plan.group_id,
&plan.source_regions,
&plan.target_regions,
&plan.pending_deallocate_region_ids,
current_region_routes,
)
.unwrap()
}
fn exit_group_staging(
plan: &RepartitionPlanEntry,
current_region_routes: &[RegionRoute],
) -> Vec<RegionRoute> {
UpdateMetadata::exit_staging_region_routes(
plan.group_id,
&plan.source_regions,
&plan.target_regions,
current_region_routes,
)
.unwrap()
}
fn region_route_by_id(region_routes: &[RegionRoute], region_id: RegionId) -> &RegionRoute {
region_routes
.iter()
.find(|route| route.region.id == region_id)
.unwrap()
}
fn test_procedure(state: Box<dyn State>, context: Context) -> RepartitionProcedure {
RepartitionProcedure { state, context }
}
@@ -870,10 +957,8 @@ mod tests {
];
let allocated_region_ids = HashSet::from([RegionId::new(table_id, 2)]);
let new_region_routes = RepartitionProcedure::filter_allocated_region_routes(
&region_routes,
&allocated_region_ids,
);
let new_region_routes =
DeallocateRegion::generate_region_routes(&region_routes, &allocated_region_ids);
assert_eq!(new_region_routes.len(), 1);
assert_eq!(new_region_routes[0].region.id, RegionId::new(table_id, 1));
@@ -910,6 +995,59 @@ mod tests {
assert!(!procedure.should_rollback_allocated_regions());
}
#[test]
fn test_register_operating_regions_preserves_route_roles() {
let keeper = Arc::new(MemoryRegionKeeper::new());
let region_routes = vec![
RegionRoute {
region: Region::new_test(RegionId::new(1024, 1)),
leader_peer: Some(Peer::empty(1)),
follower_peers: vec![],
leader_state: None,
leader_down_since: None,
write_route_policy: None,
},
RegionRoute {
region: Region::new_test(RegionId::new(1024, 2)),
leader_peer: Some(Peer::empty(2)),
follower_peers: vec![],
leader_state: Some(LeaderState::Staging),
leader_down_since: None,
write_route_policy: None,
},
RegionRoute {
region: Region::new_test(RegionId::new(1024, 3)),
leader_peer: Some(Peer::empty(3)),
follower_peers: vec![],
leader_state: Some(LeaderState::Downgrading),
leader_down_since: None,
write_route_policy: None,
},
];
let _guards = Context::register_operating_regions(&keeper, &region_routes).unwrap();
let leader_roles =
keeper.extract_operating_region_roles(1, &HashSet::from([RegionId::new(1024, 1)]));
let staging_roles =
keeper.extract_operating_region_roles(2, &HashSet::from([RegionId::new(1024, 2)]));
let downgrading_roles =
keeper.extract_operating_region_roles(3, &HashSet::from([RegionId::new(1024, 3)]));
assert_eq!(
leader_roles.get(&RegionId::new(1024, 1)),
Some(&RegionRole::Leader)
);
assert_eq!(
staging_roles.get(&RegionId::new(1024, 2)),
Some(&RegionRole::StagingLeader)
);
assert_eq!(
downgrading_roles.get(&RegionId::new(1024, 3)),
Some(&RegionRole::DowngradingLeader)
);
}
#[tokio::test]
async fn test_repartition_rollback_removes_allocated_routes_from_dispatch() {
let env = TestingEnv::new();
@@ -939,7 +1077,16 @@ mod tests {
table_id,
None,
);
persistent_ctx.plans = vec![test_plan(table_id)];
persistent_ctx.plans = vec![with_rollback_metadata(
test_plan(table_id),
vec![
test_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
),
test_region_route(RegionId::new(table_id, 3), ""),
],
)];
persistent_ctx.failed_procedures = vec![ProcedureMeta {
plan_index: 0,
group_id: Uuid::new_v4(),
@@ -1050,6 +1197,16 @@ mod tests {
None,
);
let failed_plan = test_plan(table_id);
let failed_plan = with_rollback_metadata(
failed_plan,
vec![
test_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
),
test_region_route(RegionId::new(table_id, 3), ""),
],
);
let succeeded_plan = RepartitionPlanEntry {
group_id: Uuid::new_v4(),
source_regions: vec![RegionDescriptor {
@@ -1069,6 +1226,13 @@ mod tests {
allocated_region_ids: vec![RegionId::new(table_id, 4)],
pending_deallocate_region_ids: vec![],
transition_map: vec![vec![0]],
original_target_routes: vec![
test_region_route(
RegionId::new(table_id, 2),
&range_expr("x", 100, 200).as_json_str().unwrap(),
),
test_region_route(RegionId::new(table_id, 4), ""),
],
};
persistent_ctx.plans = vec![failed_plan, succeeded_plan];
persistent_ctx.failed_procedures = vec![ProcedureMeta {
@@ -1100,6 +1264,212 @@ mod tests {
assert_eq!(region_routes[2].region.id, RegionId::new(table_id, 4));
}
#[tokio::test]
async fn test_repartition_rollback_from_collect_restores_failed_group_metadata_only() {
let env = TestingEnv::new();
let table_id = 1024;
let node_manager = Arc::new(MockDatanodeManager::new(UnexpectedErrorDatanodeHandler));
let ddl_ctx = env.ddl_context(node_manager);
let original_region_routes = vec![
test_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
),
test_region_route(
RegionId::new(table_id, 2),
&range_expr("x", 100, 200).as_json_str().unwrap(),
),
test_region_route(RegionId::new(table_id, 3), ""),
test_region_route(RegionId::new(table_id, 4), ""),
];
let failed_plan = with_rollback_metadata(
test_plan(table_id),
vec![
original_region_routes[0].clone(),
original_region_routes[2].clone(),
],
);
let succeeded_plan = RepartitionPlanEntry {
group_id: Uuid::new_v4(),
source_regions: vec![RegionDescriptor {
region_id: RegionId::new(table_id, 2),
partition_expr: range_expr("x", 100, 200),
}],
target_regions: vec![
RegionDescriptor {
region_id: RegionId::new(table_id, 2),
partition_expr: range_expr("x", 100, 150),
},
RegionDescriptor {
region_id: RegionId::new(table_id, 4),
partition_expr: range_expr("x", 150, 200),
},
],
allocated_region_ids: vec![RegionId::new(table_id, 4)],
pending_deallocate_region_ids: vec![],
transition_map: vec![vec![0, 1]],
original_target_routes: vec![
original_region_routes[1].clone(),
original_region_routes[3].clone(),
],
};
let current_region_routes = apply_group_staging(&failed_plan, &original_region_routes);
let current_region_routes = apply_group_staging(&succeeded_plan, &current_region_routes);
let current_region_routes = exit_group_staging(&succeeded_plan, &current_region_routes);
env.create_physical_table_metadata_with_wal_options(
table_id,
current_region_routes,
test_region_wal_options(&[1, 2, 3, 4]),
)
.await;
let mut persistent_ctx = PersistentContext::new(
TableName::new("test_catalog", "test_schema", "test_table"),
table_id,
None,
);
persistent_ctx.plans = vec![failed_plan, succeeded_plan.clone()];
persistent_ctx.failed_procedures = vec![ProcedureMeta {
plan_index: 0,
group_id: persistent_ctx.plans[0].group_id,
procedure_id: ProcedureId::random(),
}];
let context = Context::new(
&ddl_ctx,
env.mailbox_ctx.mailbox().clone(),
env.server_addr.clone(),
persistent_ctx,
);
let mut procedure = RepartitionProcedure {
state: Box::new(Collect::new(vec![])),
context,
};
procedure
.rollback(&TestingEnv::procedure_context())
.await
.unwrap();
assert_eq!(
current_parent_region_routes(&procedure.context).await,
vec![
test_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
),
RegionRoute {
region: Region {
id: RegionId::new(table_id, 2),
partition_expr: range_expr("x", 100, 150).as_json_str().unwrap(),
..Default::default()
},
leader_peer: Some(Peer::empty(1)),
..Default::default()
},
RegionRoute {
region: Region {
id: RegionId::new(table_id, 4),
partition_expr: range_expr("x", 150, 200).as_json_str().unwrap(),
..Default::default()
},
leader_peer: Some(Peer::empty(1)),
..Default::default()
},
]
);
}
#[tokio::test]
async fn test_repartition_rollback_from_collect_restores_unknown_group_metadata() {
let env = TestingEnv::new();
let table_id = 1024;
let node_manager = Arc::new(MockDatanodeManager::new(UnexpectedErrorDatanodeHandler));
let ddl_ctx = env.ddl_context(node_manager);
let original_region_routes = vec![
test_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
),
test_region_route(
RegionId::new(table_id, 2),
&range_expr("x", 100, 200).as_json_str().unwrap(),
),
test_region_route(RegionId::new(table_id, 3), ""),
];
let plan = with_rollback_metadata(
test_plan(table_id),
vec![
original_region_routes[0].clone(),
original_region_routes[2].clone(),
],
);
let staged_region_routes = apply_group_staging(&plan, &original_region_routes);
assert_eq!(
region_route_by_id(&staged_region_routes, RegionId::new(table_id, 1))
.region
.partition_expr(),
range_expr("x", 0, 50).as_json_str().unwrap()
);
assert!(
region_route_by_id(&staged_region_routes, RegionId::new(table_id, 1))
.is_leader_staging()
);
assert_eq!(
region_route_by_id(&staged_region_routes, RegionId::new(table_id, 3))
.region
.partition_expr(),
range_expr("x", 50, 100).as_json_str().unwrap()
);
assert!(
region_route_by_id(&staged_region_routes, RegionId::new(table_id, 3))
.is_leader_staging()
);
env.create_physical_table_metadata_with_wal_options(
table_id,
staged_region_routes,
test_region_wal_options(&[1, 2, 3]),
)
.await;
let mut persistent_ctx = PersistentContext::new(
TableName::new("test_catalog", "test_schema", "test_table"),
table_id,
None,
);
persistent_ctx.plans = vec![plan.clone()];
persistent_ctx.unknown_procedures = vec![ProcedureMeta {
plan_index: 0,
group_id: plan.group_id,
procedure_id: ProcedureId::random(),
}];
let context = Context::new(
&ddl_ctx,
env.mailbox_ctx.mailbox().clone(),
env.server_addr.clone(),
persistent_ctx,
);
let mut procedure = RepartitionProcedure {
state: Box::new(Collect::new(vec![])),
context,
};
procedure
.rollback(&TestingEnv::procedure_context())
.await
.unwrap();
assert_eq!(
current_parent_region_routes(&procedure.context).await,
vec![
original_region_routes[0].clone(),
original_region_routes[1].clone()
]
);
}
#[tokio::test]
async fn test_repartition_rollback_is_idempotent() {
let env = TestingEnv::new();
@@ -1129,7 +1499,16 @@ mod tests {
table_id,
None,
);
persistent_ctx.plans = vec![test_plan(table_id)];
persistent_ctx.plans = vec![with_rollback_metadata(
test_plan(table_id),
vec![
test_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
),
test_region_route(RegionId::new(table_id, 3), ""),
],
)];
persistent_ctx.failed_procedures = vec![ProcedureMeta {
plan_index: 0,
group_id: Uuid::new_v4(),
@@ -1164,6 +1543,147 @@ mod tests {
assert_eq!(once[1].region.id, RegionId::new(table_id, 2));
}
#[tokio::test]
async fn test_repartition_rollback_from_collect_restores_failed_merge_group_metadata_only() {
let env = TestingEnv::new();
let table_id = 1024;
let node_manager = Arc::new(MockDatanodeManager::new(UnexpectedErrorDatanodeHandler));
let ddl_ctx = env.ddl_context(node_manager);
let original_region_routes = vec![
test_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
),
test_region_route(
RegionId::new(table_id, 2),
&range_expr("x", 100, 200).as_json_str().unwrap(),
),
test_region_route(
RegionId::new(table_id, 3),
&range_expr("x", 200, 300).as_json_str().unwrap(),
),
test_region_route(RegionId::new(table_id, 4), ""),
];
let failed_merge_plan = RepartitionPlanEntry {
group_id: Uuid::new_v4(),
source_regions: vec![
RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 100),
},
RegionDescriptor {
region_id: RegionId::new(table_id, 2),
partition_expr: range_expr("x", 100, 200),
},
],
target_regions: vec![RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 200),
}],
allocated_region_ids: vec![],
pending_deallocate_region_ids: vec![RegionId::new(table_id, 2)],
transition_map: vec![vec![0], vec![0]],
original_target_routes: vec![original_region_routes[0].clone()],
};
let succeeded_split_plan = RepartitionPlanEntry {
group_id: Uuid::new_v4(),
source_regions: vec![RegionDescriptor {
region_id: RegionId::new(table_id, 3),
partition_expr: range_expr("x", 200, 300),
}],
target_regions: vec![
RegionDescriptor {
region_id: RegionId::new(table_id, 3),
partition_expr: range_expr("x", 200, 250),
},
RegionDescriptor {
region_id: RegionId::new(table_id, 4),
partition_expr: range_expr("x", 250, 300),
},
],
allocated_region_ids: vec![RegionId::new(table_id, 4)],
pending_deallocate_region_ids: vec![],
transition_map: vec![vec![0, 1]],
original_target_routes: vec![
original_region_routes[2].clone(),
original_region_routes[3].clone(),
],
};
let current_region_routes =
apply_group_staging(&failed_merge_plan, &original_region_routes);
let current_region_routes =
apply_group_staging(&succeeded_split_plan, &current_region_routes);
let staged_region_routes =
exit_group_staging(&succeeded_split_plan, &current_region_routes);
env.create_physical_table_metadata_with_wal_options(
table_id,
staged_region_routes,
test_region_wal_options(&[1, 2, 3, 4]),
)
.await;
let mut persistent_ctx = PersistentContext::new(
TableName::new("test_catalog", "test_schema", "test_table"),
table_id,
None,
);
persistent_ctx.plans = vec![failed_merge_plan, succeeded_split_plan.clone()];
persistent_ctx.failed_procedures = vec![ProcedureMeta {
plan_index: 0,
group_id: persistent_ctx.plans[0].group_id,
procedure_id: ProcedureId::random(),
}];
let context = Context::new(
&ddl_ctx,
env.mailbox_ctx.mailbox().clone(),
env.server_addr.clone(),
persistent_ctx,
);
let mut procedure = RepartitionProcedure {
state: Box::new(Collect::new(vec![])),
context,
};
procedure
.rollback(&TestingEnv::procedure_context())
.await
.unwrap();
let region_routes = current_parent_region_routes(&procedure.context).await;
assert_eq!(
region_routes,
vec![
test_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
),
test_region_route(
RegionId::new(table_id, 2),
&range_expr("x", 100, 200).as_json_str().unwrap(),
),
RegionRoute {
region: Region {
id: RegionId::new(table_id, 3),
partition_expr: range_expr("x", 200, 250).as_json_str().unwrap(),
..Default::default()
},
leader_peer: Some(Peer::empty(1)),
..Default::default()
},
RegionRoute {
region: Region {
id: RegionId::new(table_id, 4),
partition_expr: range_expr("x", 250, 300).as_json_str().unwrap(),
..Default::default()
},
leader_peer: Some(Peer::empty(1)),
..Default::default()
},
]
);
}
#[tokio::test]
async fn test_repartition_procedure_flow_split_failed_and_full_rollback() {
let env = TestingEnv::new();

View File

@@ -25,8 +25,8 @@ use common_meta::rpc::router::RegionRoute;
use common_procedure::{Context as ProcedureContext, Status};
use common_telemetry::{debug, info};
use serde::{Deserialize, Deserializer, Serialize};
use snafu::ResultExt;
use store_api::storage::{RegionNumber, TableId};
use snafu::{OptionExt, ResultExt};
use store_api::storage::{RegionId, RegionNumber, TableId};
use table::metadata::TableInfo;
use table::table_reference::TableReference;
use tokio::time::Instant;
@@ -104,7 +104,8 @@ impl BuildPlan {
table_id,
&mut next_region_number,
&self.plan_entries,
);
table_route_value.region_routes().unwrap(),
)?;
let plan_count = repartition_plan_entries.len();
let to_allocate = AllocateRegion::count_regions_to_allocate(&repartition_plan_entries);
info!(
@@ -258,25 +259,64 @@ impl AllocateRegion {
/// Converts allocation plan entries to repartition plan entries.
///
/// This method takes the allocation plan entries and converts them to repartition plan entries,
/// updating `next_region_number` for each newly allocated region.
/// This method converts allocation intents into concrete repartition plans,
/// updates `next_region_number` for newly allocated regions, and captures
/// each plan's `original_target_routes` from the current table-route view.
///
/// This also persists each plan's pre-staging target routes for rollback.
fn convert_to_repartition_plans(
table_id: TableId,
next_region_number: &mut RegionNumber,
plan_entries: &[AllocationPlanEntry],
) -> Vec<RepartitionPlanEntry> {
current_region_routes: &[RegionRoute],
) -> Result<Vec<RepartitionPlanEntry>> {
let region_routes_map = current_region_routes
.iter()
.map(|route| (route.region.id, route))
.collect::<HashMap<_, _>>();
plan_entries
.iter()
.map(|plan_entry| {
convert_allocation_plan_to_repartition_plan(
let mut plan = convert_allocation_plan_to_repartition_plan(
table_id,
next_region_number,
plan_entry,
)
);
Self::capture_plan_original_target_routes(&mut plan, &region_routes_map)?;
Ok(plan)
})
.collect()
}
fn capture_plan_original_target_routes(
plan: &mut RepartitionPlanEntry,
region_routes_map: &HashMap<RegionId, &RegionRoute>,
) -> Result<()> {
// Persist the pre-staging target-route view on the parent plan.
// Newly allocated targets are skipped because rollback deletes their
// route metadata rather than restoring an original target route.
let mut original_target_routes = Vec::with_capacity(plan.target_regions.len());
for target in &plan.target_regions {
if plan.allocated_region_ids.contains(&target.region_id) {
// This target region is to be allocated, so it doesn't exist in current routes.
continue;
}
let route = region_routes_map.get(&target.region_id).context(
error::RepartitionTargetRegionMissingSnafu {
group_id: plan.group_id,
region_id: target.region_id,
},
)?;
{
original_target_routes.push((*route).clone());
}
}
plan.original_target_routes = original_target_routes;
Ok(())
}
/// Collects all regions that need to be allocated from the repartition plan entries.
fn collect_allocate_regions(
repartition_plan_entries: &[RepartitionPlanEntry],
@@ -357,6 +397,8 @@ impl AllocateRegion {
#[cfg(test)]
mod tests {
use common_meta::peer::Peer;
use common_meta::rpc::router::{Region, RegionRoute};
use store_api::storage::RegionId;
use uuid::Uuid;
@@ -405,6 +447,20 @@ mod tests {
}
}
fn create_current_region_routes(table_id: TableId, region_numbers: &[u32]) -> Vec<RegionRoute> {
region_numbers
.iter()
.map(|region_number| RegionRoute {
region: Region {
id: RegionId::new(table_id, *region_number),
..Default::default()
},
leader_peer: Some(Peer::empty(1)),
..Default::default()
})
.collect()
}
#[test]
fn test_convert_to_repartition_plans_no_allocation() {
let table_id = 1024;
@@ -421,7 +477,9 @@ mod tests {
table_id,
&mut next_region_number,
&plan_entries,
);
&create_current_region_routes(table_id, &[1, 2]),
)
.unwrap();
assert_eq!(result.len(), 1);
assert_eq!(result[0].target_regions.len(), 2);
@@ -446,7 +504,9 @@ mod tests {
table_id,
&mut next_region_number,
&plan_entries,
);
&create_current_region_routes(table_id, &[1, 2]),
)
.unwrap();
assert_eq!(result.len(), 1);
assert_eq!(result[0].target_regions.len(), 4);
@@ -479,7 +539,9 @@ mod tests {
table_id,
&mut next_region_number,
&plan_entries,
);
&create_current_region_routes(table_id, &[1, 2, 3, 4]),
)
.unwrap();
assert_eq!(result.len(), 3);
assert_eq!(result[0].allocated_region_ids.len(), 1);
@@ -504,7 +566,9 @@ mod tests {
table_id,
&mut next_region_number,
&plan_entries,
);
&create_current_region_routes(table_id, &[1, 2, 3, 4]),
)
.unwrap();
let count = AllocateRegion::count_regions_to_allocate(&repartition_plans);
assert_eq!(count, 2);
@@ -524,7 +588,9 @@ mod tests {
table_id,
&mut next_region_number,
&plan_entries,
);
&create_current_region_routes(table_id, &[1, 2]),
)
.unwrap();
let allocate_regions = AllocateRegion::collect_allocate_regions(&repartition_plans);
assert_eq!(allocate_regions.len(), 2);

View File

@@ -267,6 +267,7 @@ mod tests {
allocated_region_ids: vec![],
pending_deallocate_region_ids: vec![RegionId::new(table_id, 1)],
transition_map: vec![],
original_target_routes: vec![],
}];
let mut state = DeallocateRegion;

View File

@@ -41,18 +41,14 @@ use common_procedure::{
Context as ProcedureContext, Error as ProcedureError, LockKey, Procedure,
Result as ProcedureResult, Status, StringKey, UserMetadata,
};
use common_telemetry::{error, info, warn};
use common_telemetry::{error, info};
use serde::{Deserialize, Serialize};
use snafu::{OptionExt, ResultExt};
use store_api::storage::{RegionId, TableId};
use uuid::Uuid;
use crate::error::{self, Result};
use crate::procedure::repartition::group::apply_staging_manifest::ApplyStagingManifest;
use crate::procedure::repartition::group::enter_staging_region::EnterStagingRegion;
use crate::procedure::repartition::group::remap_manifest::RemapManifest;
use crate::procedure::repartition::group::repartition_start::RepartitionStart;
use crate::procedure::repartition::group::update_metadata::UpdateMetadata;
use crate::procedure::repartition::plan::RegionDescriptor;
use crate::procedure::repartition::utils::get_datanode_table_value;
use crate::procedure::repartition::{self};
@@ -196,62 +192,6 @@ impl RepartitionGroupProcedure {
Ok(Self { state, context })
}
async fn rollback_inner(&mut self, procedure_ctx: &ProcedureContext) -> Result<()> {
if !self.should_rollback_metadata() {
return Ok(());
}
let table_lock =
common_meta::lock_key::TableLock::Write(self.context.persistent_ctx.table_id).into();
let _guard = procedure_ctx.provider.acquire_lock(&table_lock).await;
UpdateMetadata::RollbackStaging
.rollback_staging_regions(&mut self.context)
.await?;
if let Err(err) = self.context.invalidate_table_cache().await {
warn!(
err;
"Failed to broadcast the invalidate table cache message during repartition group rollback"
);
}
Ok(())
}
/// Returns whether group rollback should revert staging metadata.
///
/// This uses an "after metadata apply, before exit staging" semantic.
/// Once execution reaches `UpdateMetadata::ApplyStaging` or any later staging state,
/// rollback must restore table-route metadata back to the pre-apply view.
///
/// State flow:
/// `RepartitionStart -> SyncRegion -> UpdateMetadata::ApplyStaging -> EnterStagingRegion`
/// ` -> RemapManifest -> ApplyStagingManifest -> UpdateMetadata::ExitStaging -> RepartitionEnd`
/// ` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`
/// ` rollback staging metadata`
///
/// Notes:
/// - `RepartitionStart` / `SyncRegion`: no-op, metadata has not been staged yet.
/// - `UpdateMetadata::ApplyStaging` / `EnterStagingRegion` / `RemapManifest` /
/// `ApplyStagingManifest` / `UpdateMetadata::RollbackStaging`: rollback-active.
/// - `UpdateMetadata::ExitStaging` / `RepartitionEnd`: excluded, because metadata has
/// already moved into the post-commit exit path.
fn should_rollback_metadata(&self) -> bool {
self.state.as_any().is::<EnterStagingRegion>()
|| self.state.as_any().is::<RemapManifest>()
|| self.state.as_any().is::<ApplyStagingManifest>()
|| self
.state
.as_any()
.downcast_ref::<UpdateMetadata>()
.is_some_and(|state| {
matches!(
state,
UpdateMetadata::ApplyStaging | UpdateMetadata::RollbackStaging
)
})
}
}
#[async_trait::async_trait]
@@ -260,10 +200,10 @@ impl Procedure for RepartitionGroupProcedure {
Self::TYPE_NAME
}
async fn rollback(&mut self, ctx: &ProcedureContext) -> ProcedureResult<()> {
self.rollback_inner(ctx)
.await
.map_err(ProcedureError::external)
async fn rollback(&mut self, _ctx: &ProcedureContext) -> ProcedureResult<()> {
// The parent repartition procedure is responsible for rollback and recovery.
// Subprocedures are not recovered after metasrv restarts, so implementing rollback for them is meaningless.
Ok(())
}
#[tracing::instrument(skip_all, fields(
@@ -304,7 +244,9 @@ impl Procedure for RepartitionGroupProcedure {
}
fn rollback_supported(&self) -> bool {
true
// Parent repartition owns rollback and recovery because subprocedures are
// not relied on as durable rollback units across metasrv restarts.
false
}
fn dump(&self) -> ProcedureResult<String> {
@@ -665,149 +607,12 @@ pub(crate) trait State: Sync + Send + Debug {
mod tests {
use std::assert_matches;
use std::sync::Arc;
use std::time::Duration;
use common_meta::key::TableMetadataManager;
use common_meta::kv_backend::test_util::MockKvBackendBuilder;
use common_meta::peer::Peer;
use common_meta::rpc::router::{Region, RegionRoute};
use common_procedure::{Context as ProcedureContext, Procedure, ProcedureId};
use common_procedure_test::MockContextProvider;
use partition::expr::PartitionExpr;
use store_api::storage::RegionId;
use super::{
Context, PersistentContext, RepartitionGroupProcedure, RepartitionStart, State,
region_routes,
};
use crate::error::Error;
use crate::procedure::repartition::dispatch::build_region_mapping;
use crate::procedure::repartition::group::apply_staging_manifest::ApplyStagingManifest;
use crate::procedure::repartition::group::enter_staging_region::EnterStagingRegion;
use crate::procedure::repartition::group::remap_manifest::RemapManifest;
use crate::procedure::repartition::group::repartition_start::RepartitionStart as GroupRepartitionStart;
use crate::procedure::repartition::group::sync_region::SyncRegion;
use crate::procedure::repartition::group::update_metadata::UpdateMetadata;
use crate::procedure::repartition::plan;
use crate::procedure::repartition::repartition_start::RepartitionStart as ParentRepartitionStart;
use crate::procedure::repartition::test_util::{
TestingEnv, new_persistent_context, range_expr,
};
struct GroupRollbackFixture {
context: Context,
original_region_routes: Vec<RegionRoute>,
next_state: Option<Box<dyn State>>,
}
async fn new_group_rollback_fixture(
original_region_routes: Vec<RegionRoute>,
from_exprs: Vec<PartitionExpr>,
to_exprs: Vec<PartitionExpr>,
sync_region: bool,
) -> GroupRollbackFixture {
let env = TestingEnv::new();
let procedure_ctx = TestingEnv::procedure_context();
let table_id = 1024;
let mut next_region_number = 10;
env.create_physical_table_metadata(table_id, original_region_routes.clone())
.await;
let (_, physical_route) = env
.table_metadata_manager
.table_route_manager()
.get_physical_table_route(table_id)
.await
.unwrap();
let allocation_plans =
ParentRepartitionStart::build_plan(&physical_route, &from_exprs, &to_exprs).unwrap();
assert_eq!(allocation_plans.len(), 1);
let repartition_plan = plan::convert_allocation_plan_to_repartition_plan(
table_id,
&mut next_region_number,
&allocation_plans[0],
);
let region_mapping = build_region_mapping(
&repartition_plan.source_regions,
&repartition_plan.target_regions,
&repartition_plan.transition_map,
);
let persistent_context = PersistentContext::new(
repartition_plan.group_id,
table_id,
"test_catalog".to_string(),
"test_schema".to_string(),
repartition_plan.source_regions,
repartition_plan.target_regions,
region_mapping,
sync_region,
repartition_plan.allocated_region_ids,
repartition_plan.pending_deallocate_region_ids,
Duration::from_secs(120),
);
let mut context = env.create_context(persistent_context);
let (next_state, _) = GroupRepartitionStart
.next(&mut context, &procedure_ctx)
.await
.unwrap();
GroupRollbackFixture {
context,
original_region_routes,
next_state: Some(next_state),
}
}
async fn new_split_group_rollback_fixture(sync_region: bool) -> GroupRollbackFixture {
new_group_rollback_fixture(
vec![
new_region_route(RegionId::new(1024, 1), Some(range_expr("x", 0, 100))),
new_region_route(RegionId::new(1024, 2), Some(range_expr("x", 100, 200))),
new_region_route(RegionId::new(1024, 10), None),
],
vec![range_expr("x", 0, 100)],
vec![range_expr("x", 0, 50), range_expr("x", 50, 100)],
sync_region,
)
.await
}
async fn new_merge_group_rollback_fixture(sync_region: bool) -> GroupRollbackFixture {
new_group_rollback_fixture(
vec![
new_region_route(RegionId::new(1024, 1), Some(range_expr("x", 0, 100))),
new_region_route(RegionId::new(1024, 2), Some(range_expr("x", 100, 200))),
new_region_route(RegionId::new(1024, 3), Some(range_expr("x", 200, 300))),
],
vec![range_expr("x", 0, 100), range_expr("x", 100, 200)],
vec![range_expr("x", 0, 200)],
sync_region,
)
.await
}
async fn stage_metadata(context: &mut Context) {
UpdateMetadata::ApplyStaging
.apply_staging_regions(context)
.await
.unwrap();
}
fn new_region_route(region_id: RegionId, partition_expr: Option<PartitionExpr>) -> RegionRoute {
RegionRoute {
region: Region {
id: region_id,
partition_expr: partition_expr
.map(|expr| expr.as_json_str().unwrap())
.unwrap_or_default(),
..Default::default()
},
leader_peer: Some(Peer::empty(1)),
..Default::default()
}
}
use crate::procedure::repartition::test_util::{TestingEnv, new_persistent_context};
#[tokio::test]
async fn test_get_table_route_value_not_found_error() {
@@ -856,198 +661,4 @@ mod tests {
let err = ctx.get_datanode_table_value(1024, 1).await.unwrap_err();
assert!(err.is_retryable());
}
#[tokio::test]
async fn test_group_rollback_supported() {
let env = TestingEnv::new();
let persistent_context = new_persistent_context(1024, vec![], vec![]);
let procedure = RepartitionGroupProcedure {
state: Box::new(RepartitionStart),
context: env.create_context(persistent_context),
};
assert!(procedure.rollback_supported());
}
#[tokio::test]
async fn test_group_rollback_is_noop_before_apply_staging() {
let env = TestingEnv::new();
let persistent_context = new_persistent_context(1024, vec![], vec![]);
let ctx = env.create_context(persistent_context.clone());
let mut procedure = RepartitionGroupProcedure {
state: Box::new(RepartitionStart),
context: ctx,
};
let provider = Arc::new(MockContextProvider::new(Default::default()));
let procedure_ctx = ProcedureContext {
procedure_id: ProcedureId::random(),
provider,
};
procedure.rollback(&procedure_ctx).await.unwrap();
assert!(procedure.state.as_any().is::<RepartitionStart>());
assert_eq!(procedure.context.persistent_ctx, persistent_context);
}
async fn assert_noop_rollback(
fixture: GroupRollbackFixture,
state: Box<dyn State>,
assert_state: impl FnOnce(&dyn State),
) {
let original_region_routes = fixture.original_region_routes.clone();
let procedure_ctx = TestingEnv::procedure_context();
let mut procedure = RepartitionGroupProcedure {
state,
context: fixture.context,
};
procedure.rollback(&procedure_ctx).await.unwrap();
assert_state(&*procedure.state);
let table_route_value = procedure
.context
.get_table_route_value()
.await
.unwrap()
.into_inner();
let region_routes = region_routes(
procedure.context.persistent_ctx.table_id,
&table_route_value,
)
.unwrap();
assert_eq!(region_routes.clone(), original_region_routes);
}
async fn assert_metadata_rollback_restores_table_route(
mut fixture: GroupRollbackFixture,
state: Box<dyn State>,
) {
let original_region_routes = fixture.original_region_routes.clone();
let procedure_ctx = TestingEnv::procedure_context();
stage_metadata(&mut fixture.context).await;
let mut procedure = RepartitionGroupProcedure {
state,
context: fixture.context,
};
procedure.rollback(&procedure_ctx).await.unwrap();
let table_route_value = procedure
.context
.get_table_route_value()
.await
.unwrap()
.into_inner();
let region_routes = region_routes(
procedure.context.persistent_ctx.table_id,
&table_route_value,
)
.unwrap();
assert_eq!(region_routes.clone(), original_region_routes);
}
#[tokio::test]
async fn test_group_rollback_is_noop_in_sync_region() {
let mut fixture = new_split_group_rollback_fixture(true).await;
assert!(
fixture
.next_state
.as_ref()
.unwrap()
.as_any()
.is::<SyncRegion>()
);
let state = fixture.next_state.take().unwrap();
assert_noop_rollback(fixture, state, |state| {
assert!(state.as_any().is::<SyncRegion>());
})
.await;
}
#[tokio::test]
async fn test_group_rollback_is_noop_in_exit_staging() {
let fixture = new_split_group_rollback_fixture(false).await;
assert_noop_rollback(fixture, Box::new(UpdateMetadata::ExitStaging), |state| {
assert!(state.as_any().is::<UpdateMetadata>());
assert!(matches!(
state.as_any().downcast_ref::<UpdateMetadata>(),
Some(UpdateMetadata::ExitStaging)
));
})
.await;
}
#[tokio::test]
async fn test_group_rollback_restores_split_routes_from_apply_staging() {
let fixture = new_split_group_rollback_fixture(false).await;
assert_metadata_rollback_restores_table_route(
fixture,
Box::new(UpdateMetadata::ApplyStaging),
)
.await;
}
#[tokio::test]
async fn test_group_rollback_restores_split_routes_from_enter_staging_region() {
let fixture = new_split_group_rollback_fixture(false).await;
assert_metadata_rollback_restores_table_route(fixture, Box::new(EnterStagingRegion)).await;
}
#[tokio::test]
async fn test_group_rollback_restores_split_routes_from_remap_manifest() {
let fixture = new_split_group_rollback_fixture(false).await;
assert_metadata_rollback_restores_table_route(fixture, Box::new(RemapManifest)).await;
}
#[tokio::test]
async fn test_group_rollback_restores_split_routes_from_apply_staging_manifest() {
let fixture = new_split_group_rollback_fixture(false).await;
assert_metadata_rollback_restores_table_route(fixture, Box::new(ApplyStagingManifest))
.await;
}
#[tokio::test]
async fn test_group_rollback_restores_merge_routes_and_is_idempotent() {
let mut fixture = new_merge_group_rollback_fixture(false).await;
let original_region_routes = fixture.original_region_routes.clone();
let procedure_ctx = TestingEnv::procedure_context();
stage_metadata(&mut fixture.context).await;
let mut procedure = RepartitionGroupProcedure {
state: Box::new(UpdateMetadata::ApplyStaging),
context: fixture.context,
};
procedure.rollback(&procedure_ctx).await.unwrap();
let table_route_value = procedure
.context
.get_table_route_value()
.await
.unwrap()
.into_inner();
let once = region_routes(
procedure.context.persistent_ctx.table_id,
&table_route_value,
)
.unwrap()
.clone();
procedure.rollback(&procedure_ctx).await.unwrap();
let table_route_value = procedure
.context
.get_table_route_value()
.await
.unwrap()
.into_inner();
let twice = region_routes(
procedure.context.persistent_ctx.table_id,
&table_route_value,
)
.unwrap()
.clone();
assert_eq!(once, original_region_routes);
assert_eq!(once, twice);
}
}

View File

@@ -14,7 +14,6 @@
pub(crate) mod apply_staging_region;
pub(crate) mod exit_staging_region;
pub(crate) mod rollback_staging_region;
use std::any::Any;
use std::time::Instant;
@@ -34,9 +33,7 @@ use crate::procedure::repartition::group::{Context, State};
pub enum UpdateMetadata {
/// Applies the new partition expressions for staging regions.
ApplyStaging,
/// Rolls back the new partition expressions for staging regions.
RollbackStaging,
/// Exits the staging regions.
/// Exits the staging regions after the group finishes its forward path.
ExitStaging,
}
@@ -64,17 +61,6 @@ impl State for UpdateMetadata {
ctx.update_update_metadata_elapsed(timer.elapsed());
Ok((Box::new(EnterStagingRegion), Status::executing(false)))
}
UpdateMetadata::RollbackStaging => {
self.rollback_staging_regions(ctx).await?;
if let Err(err) = ctx.invalidate_table_cache().await {
warn!(
err;
"Failed to broadcast the invalidate table cache message during the rollback staging regions"
);
};
Ok((Box::new(RepartitionEnd), Status::executing(false)))
}
UpdateMetadata::ExitStaging => {
self.exit_staging_regions(ctx).await?;
if let Err(err) = ctx.invalidate_table_cache().await {

View File

@@ -25,7 +25,7 @@ use crate::procedure::repartition::group::{Context, GroupId, region_routes};
use crate::procedure::repartition::plan::RegionDescriptor;
impl UpdateMetadata {
fn exit_staging_region_routes(
pub(crate) fn exit_staging_region_routes(
group_id: GroupId,
sources: &[RegionDescriptor],
targets: &[RegionDescriptor],

View File

@@ -1,301 +0,0 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::HashMap;
use common_error::ext::BoxedError;
use common_meta::rpc::router::RegionRoute;
use common_telemetry::{error, info};
use snafu::{OptionExt, ResultExt};
use store_api::storage::RegionId;
use crate::error::{self, Result};
use crate::procedure::repartition::group::update_metadata::UpdateMetadata;
use crate::procedure::repartition::group::{Context, GroupId, region_routes};
use crate::procedure::repartition::plan::RegionDescriptor;
impl UpdateMetadata {
/// Rolls back the staging regions.
///
/// Abort:
/// - Source region not found.
/// - Target region not found.
fn rollback_staging_region_routes(
group_id: GroupId,
sources: &[RegionDescriptor],
original_target_routes: &[RegionRoute],
pending_deallocate_region_ids: &[RegionId],
current_region_routes: &[RegionRoute],
) -> Result<Vec<RegionRoute>> {
let mut region_routes = current_region_routes.to_vec();
let mut region_routes_map = region_routes
.iter_mut()
.map(|route| (route.region.id, route))
.collect::<HashMap<_, _>>();
for source in sources {
let region_route = region_routes_map.get_mut(&source.region_id).context(
error::RepartitionSourceRegionMissingSnafu {
group_id,
region_id: source.region_id,
},
)?;
// Clean leader staging state for source regions.
region_route.clear_leader_staging();
if pending_deallocate_region_ids.contains(&source.region_id) {
// Clean ignore all writes state for source regions if it's pending to be deallocated,
// which means the source region is merged into the target region.
region_route.clear_ignore_all_writes();
}
}
for target in original_target_routes {
let region_route = region_routes_map.get_mut(&target.region.id).context(
error::RepartitionTargetRegionMissingSnafu {
group_id,
region_id: target.region.id,
},
)?;
// Revert the partition expression and write route policy to the original value for the target region.
region_route.region.partition_expr = target.region.partition_expr.clone();
region_route.write_route_policy = target.write_route_policy;
// Clean leader staging state for target regions.
region_route.clear_leader_staging();
}
Ok(region_routes)
}
/// Rolls back the metadata for staging regions.
///
/// Abort:
/// - Table route is not physical.
/// - Source region not found.
/// - Target region not found.
/// - Failed to update the table route.
/// - Central region datanode table value not found.
pub(crate) async fn rollback_staging_regions(&self, ctx: &mut Context) -> Result<()> {
let table_id = ctx.persistent_ctx.table_id;
let group_id = ctx.persistent_ctx.group_id;
let current_table_route_value = ctx.get_table_route_value().await?;
let region_routes = region_routes(table_id, current_table_route_value.get_inner_ref())?;
// Safety: prepare result is set in [RepartitionStart] state.
let prepare_result = ctx.persistent_ctx.group_prepare_result.as_ref().unwrap();
let new_region_routes = Self::rollback_staging_region_routes(
group_id,
&ctx.persistent_ctx.sources,
&prepare_result.target_routes,
&ctx.persistent_ctx.pending_deallocate_region_ids,
region_routes,
)?;
let source_count = prepare_result.source_routes.len();
let target_count = prepare_result.target_routes.len();
info!(
"Rollback staging regions for repartition, table_id: {}, group_id: {}, sources: {}, targets: {}",
table_id, group_id, source_count, target_count
);
if let Err(err) = ctx
.update_table_route(&current_table_route_value, new_region_routes)
.await
{
error!(err; "Failed to update the table route during the updating metadata for repartition: {table_id}, group_id: {group_id}");
return Err(BoxedError::new(err)).context(error::RetryLaterWithSourceSnafu {
reason: format!(
"Failed to update the table route during the updating metadata for repartition: {table_id}, group_id: {group_id}"
),
});
};
Ok(())
}
}
#[cfg(test)]
mod tests {
use std::collections::HashSet;
use common_meta::peer::Peer;
use common_meta::rpc::router::{LeaderState, Region, RegionRoute};
use store_api::storage::RegionId;
use uuid::Uuid;
use crate::procedure::repartition::group::update_metadata::UpdateMetadata;
use crate::procedure::repartition::plan::RegionDescriptor;
use crate::procedure::repartition::test_util::range_expr;
fn new_region_route(
region_id: RegionId,
partition_expr: &str,
leader_state: Option<LeaderState>,
ignore_all_writes: bool,
) -> RegionRoute {
let mut route = RegionRoute {
region: Region {
id: region_id,
partition_expr: partition_expr.to_string(),
..Default::default()
},
leader_peer: Some(Peer::empty(1)),
leader_state,
..Default::default()
};
if ignore_all_writes {
route.set_ignore_all_writes();
}
route
}
fn original_target_routes(
region_routes: &[RegionRoute],
targets: &[RegionDescriptor],
) -> Vec<RegionRoute> {
let target_ids = targets
.iter()
.map(|target| target.region_id)
.collect::<HashSet<_>>();
region_routes
.iter()
.filter(|route| target_ids.contains(&route.region.id))
.cloned()
.collect()
}
#[test]
fn test_rollback_staging_region_routes_split_case() {
let group_id = Uuid::new_v4();
let table_id = 1024;
let original_region_routes = vec![
new_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
None,
false,
),
new_region_route(
RegionId::new(table_id, 2),
&range_expr("x", 100, 200).as_json_str().unwrap(),
None,
false,
),
new_region_route(RegionId::new(table_id, 3), "", None, false),
];
let sources = vec![RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 100),
}];
let targets = vec![
RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 50),
},
RegionDescriptor {
region_id: RegionId::new(table_id, 3),
partition_expr: range_expr("x", 50, 100),
},
];
let applied_region_routes = UpdateMetadata::apply_staging_region_routes(
group_id,
&sources,
&targets,
&[],
&original_region_routes,
)
.unwrap();
let target_routes = original_target_routes(&original_region_routes, &targets);
let new_region_routes = UpdateMetadata::rollback_staging_region_routes(
group_id,
&sources,
&target_routes,
&[],
&applied_region_routes,
)
.unwrap();
assert_eq!(new_region_routes, original_region_routes);
}
#[test]
fn test_rollback_staging_region_routes_merge_case_is_idempotent() {
let group_id = Uuid::new_v4();
let table_id = 1024;
let original_region_routes = vec![
new_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
None,
false,
),
new_region_route(
RegionId::new(table_id, 2),
&range_expr("x", 100, 200).as_json_str().unwrap(),
None,
false,
),
new_region_route(
RegionId::new(table_id, 3),
&range_expr("x", 200, 300).as_json_str().unwrap(),
None,
false,
),
];
let sources = vec![
RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 100),
},
RegionDescriptor {
region_id: RegionId::new(table_id, 2),
partition_expr: range_expr("x", 100, 200),
},
];
let targets = vec![RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 200),
}];
let target_routes = original_target_routes(&original_region_routes, &targets);
let applied_region_routes = UpdateMetadata::apply_staging_region_routes(
group_id,
&sources,
&targets,
&[RegionId::new(table_id, 2)],
&original_region_routes,
)
.unwrap();
let once = UpdateMetadata::rollback_staging_region_routes(
group_id,
&sources,
&target_routes,
&[RegionId::new(table_id, 2)],
&applied_region_routes,
)
.unwrap();
let twice = UpdateMetadata::rollback_staging_region_routes(
group_id,
&sources,
&target_routes,
&[RegionId::new(table_id, 2)],
&once,
)
.unwrap();
assert_eq!(once, original_region_routes);
assert_eq!(once, twice);
}
}

View File

@@ -14,6 +14,7 @@
use std::cmp::Ordering;
use common_meta::rpc::router::RegionRoute;
use partition::expr::PartitionExpr;
use serde::{Deserialize, Serialize};
use store_api::storage::{RegionId, RegionNumber, TableId};
@@ -46,7 +47,7 @@ pub struct AllocationPlanEntry {
/// A plan entry for the dispatch phase after region allocation,
/// with concrete source and target region descriptors.
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct RepartitionPlanEntry {
/// The group id for this plan entry.
pub group_id: GroupId,
@@ -61,6 +62,9 @@ pub struct RepartitionPlanEntry {
/// For each `source_regions[k]`, the corresponding vector contains global
/// `target_regions` that overlap with it.
pub transition_map: Vec<Vec<usize>>,
/// Pre-staging target routes persisted for parent rollback and recovery.
#[serde(default)]
pub original_target_routes: Vec<RegionRoute>,
}
impl RepartitionPlanEntry {
@@ -138,6 +142,7 @@ pub fn convert_allocation_plan_to_repartition_plan(
allocated_region_ids,
pending_deallocate_region_ids: vec![],
transition_map: transition_map.clone(),
original_target_routes: vec![],
}
}
Ordering::Equal => {
@@ -157,6 +162,7 @@ pub fn convert_allocation_plan_to_repartition_plan(
allocated_region_ids: vec![],
pending_deallocate_region_ids: vec![],
transition_map: transition_map.clone(),
original_target_routes: vec![],
}
}
Ordering::Greater => {
@@ -184,6 +190,7 @@ pub fn convert_allocation_plan_to_repartition_plan(
allocated_region_ids: vec![],
pending_deallocate_region_ids,
transition_map: transition_map.clone(),
original_target_routes: vec![],
}
}
}

View File

@@ -22,6 +22,8 @@ use snafu::{OptionExt, ResultExt, ensure};
use store_api::storage::{RegionId, RegionNumber, TableId};
use crate::error::{self, Result};
use crate::procedure::repartition::group::GroupId;
use crate::procedure::repartition::plan::RegionDescriptor;
/// Returns the `datanode_table_value`
///
@@ -118,14 +120,79 @@ pub fn merge_and_validate_region_wal_options(
Ok(new_region_wal_options)
}
/// Restores group staging metadata in-place for parent repartition rollback.
///
/// This helper lives in repartition utilities instead of the group subprocedure
/// because parent repartition owns crash recovery and rollback selection.
///
/// The function mutates `region_routes` in place to avoid rebuilding the route
/// vector for each selected plan. It restores:
/// - source-region leader staging flags,
/// - merge-source `ignore_all_writes` markers for pending-deallocate sources,
/// - target-region partition expressions,
/// - target-region write-route policies,
/// - target-region leader staging flags.
///
/// `original_target_routes` contains only pre-existing target routes.
/// Newly allocated targets are removed by parent rollback instead of being
/// restored here.
pub fn rollback_group_metadata_routes(
group_id: GroupId,
source_regions: &[RegionDescriptor],
original_target_routes: &[RegionRoute],
allocated_region_ids: &[RegionId],
pending_deallocate_region_ids: &[RegionId],
region_routes_map: &mut HashMap<RegionId, &mut RegionRoute>,
) -> Result<()> {
for source in source_regions {
let region_route = region_routes_map.get_mut(&source.region_id).context(
error::RepartitionSourceRegionMissingSnafu {
group_id,
region_id: source.region_id,
},
)?;
region_route.clear_leader_staging();
if pending_deallocate_region_ids.contains(&source.region_id) {
region_route.clear_ignore_all_writes();
}
}
for target in original_target_routes {
let Some(region_route) = region_routes_map.get_mut(&target.region.id) else {
// Ignores newly allocated region routes that do not exist in the current region routes.
// They may have already been deleted (to ensure idempotency).
if allocated_region_ids.contains(&target.region.id) {
continue;
}
return error::RepartitionTargetRegionMissingSnafu {
group_id,
region_id: target.region.id,
}
.fail();
};
region_route.region.partition_expr = target.region.partition_expr.clone();
region_route.write_route_policy = target.write_route_policy;
region_route.clear_leader_staging();
}
Ok(())
}
#[cfg(test)]
mod tests {
use std::collections::HashSet;
use common_meta::peer::Peer;
use common_meta::rpc::router::{Region, RegionRoute};
use common_meta::rpc::router::{LeaderState, Region, RegionRoute};
use common_wal::options::{KafkaWalOptions, WalOptions};
use store_api::storage::RegionId;
use uuid::Uuid;
use super::*;
use crate::procedure::repartition::group::update_metadata::UpdateMetadata;
use crate::procedure::repartition::plan::RegionDescriptor;
use crate::procedure::repartition::test_util::range_expr;
/// Helper function to create a Kafka WAL option string from a topic name.
fn kafka_wal_option(topic: &str) -> String {
@@ -149,6 +216,45 @@ mod tests {
}
}
fn new_staged_region_route(
region_id: RegionId,
partition_expr: &str,
leader_state: Option<LeaderState>,
ignore_all_writes: bool,
) -> RegionRoute {
let mut route = RegionRoute {
region: Region {
id: region_id,
partition_expr: partition_expr.to_string(),
..Default::default()
},
leader_peer: Some(Peer::empty(1)),
leader_state,
..Default::default()
};
if ignore_all_writes {
route.set_ignore_all_writes();
}
route
}
fn original_target_routes(
region_routes: &[RegionRoute],
targets: &[RegionDescriptor],
) -> Vec<RegionRoute> {
let target_ids = targets
.iter()
.map(|target| target.region_id)
.collect::<HashSet<_>>();
region_routes
.iter()
.filter(|route| target_ids.contains(&route.region.id))
.cloned()
.collect()
}
#[test]
fn test_merge_and_validate_region_wal_options_success() {
let table_id = 1;
@@ -254,4 +360,141 @@ mod tests {
assert!(error_msg.contains("Mismatch"));
assert!(error_msg.contains(&table_id.to_string()));
}
#[test]
fn test_rollback_group_metadata_routes_split_case() {
let group_id = Uuid::new_v4();
let table_id = 1024;
let original_region_routes = vec![
new_staged_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
None,
false,
),
new_staged_region_route(
RegionId::new(table_id, 2),
&range_expr("x", 100, 200).as_json_str().unwrap(),
None,
false,
),
new_staged_region_route(RegionId::new(table_id, 3), "", None, false),
];
let sources = vec![RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 100),
}];
let targets = vec![
RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 50),
},
RegionDescriptor {
region_id: RegionId::new(table_id, 3),
partition_expr: range_expr("x", 50, 100),
},
];
let mut applied_region_routes = UpdateMetadata::apply_staging_region_routes(
group_id,
&sources,
&targets,
&[],
&original_region_routes,
)
.unwrap();
let target_routes = original_target_routes(&original_region_routes, &targets);
rollback_group_metadata_routes(
group_id,
&sources,
&target_routes,
&[],
&[],
&mut applied_region_routes
.iter_mut()
.map(|route| (route.region.id, route))
.collect(),
)
.unwrap();
assert_eq!(applied_region_routes, original_region_routes);
}
#[test]
fn test_rollback_group_metadata_routes_merge_case_is_idempotent() {
let group_id = Uuid::new_v4();
let table_id = 1024;
let original_region_routes = vec![
new_staged_region_route(
RegionId::new(table_id, 1),
&range_expr("x", 0, 100).as_json_str().unwrap(),
None,
false,
),
new_staged_region_route(
RegionId::new(table_id, 2),
&range_expr("x", 100, 200).as_json_str().unwrap(),
None,
false,
),
new_staged_region_route(
RegionId::new(table_id, 3),
&range_expr("x", 200, 300).as_json_str().unwrap(),
None,
false,
),
];
let sources = vec![
RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 100),
},
RegionDescriptor {
region_id: RegionId::new(table_id, 2),
partition_expr: range_expr("x", 100, 200),
},
];
let targets = vec![RegionDescriptor {
region_id: RegionId::new(table_id, 1),
partition_expr: range_expr("x", 0, 200),
}];
let target_routes = original_target_routes(&original_region_routes, &targets);
let mut once = UpdateMetadata::apply_staging_region_routes(
group_id,
&sources,
&targets,
&[RegionId::new(table_id, 2)],
&original_region_routes,
)
.unwrap();
rollback_group_metadata_routes(
group_id,
&sources,
&target_routes,
&[],
&[RegionId::new(table_id, 2)],
&mut once
.iter_mut()
.map(|route| (route.region.id, route))
.collect(),
)
.unwrap();
let mut twice = once.clone();
rollback_group_metadata_routes(
group_id,
&sources,
&target_routes,
&[],
&[RegionId::new(table_id, 2)],
&mut twice
.iter_mut()
.map(|route| (route.region.id, route))
.collect(),
)
.unwrap();
assert_eq!(once, original_region_routes);
assert_eq!(once, twice);
}
}

View File

@@ -219,7 +219,7 @@ async fn test_on_datanode_create_regions() {
}
});
let status = procedure.on_datanode_create_regions().await.unwrap();
let status = procedure.on_datanode_create_regions(false).await.unwrap();
assert!(matches!(
status,
Status::Executing {

View File

@@ -20,7 +20,7 @@ use common_meta::key::TableMetadataManagerRef;
use common_meta::key::table_route::TableRouteValue;
use common_meta::region_keeper::MemoryRegionKeeperRef;
use common_meta::rpc::router::RegionRoute;
use common_telemetry::warn;
use common_telemetry::{info, warn};
use snafu::ResultExt;
use store_api::region_engine::RegionRole;
use store_api::storage::{RegionId, TableId};
@@ -63,13 +63,9 @@ fn renew_region_lease_via_region_route(
if let Some(leader) = &region_route.leader_peer
&& leader.id == datanode_id
{
let region_role = if region_route.is_leader_downgrading() {
RegionRole::DowngradingLeader
} else {
RegionRole::Leader
};
return Some((region_id, region_role));
return region_route
.leader_region_role()
.map(|region_role| (region_id, region_role));
}
// If it's a follower region on this datanode.
@@ -81,11 +77,28 @@ fn renew_region_lease_via_region_route(
return Some((region_id, RegionRole::Follower));
}
warn!(
"Denied to renew region lease for datanode: {datanode_id}, region_id: {region_id}, region_routes: {:?}",
region_route
);
// The region doesn't belong to this datanode.
None
}
fn renew_region_lease_via_operating_regions(
operating_regions: &HashMap<RegionId, RegionRole>,
datanode_id: DatanodeId,
region_id: RegionId,
reported_role: RegionRole,
) -> Option<RegionLeaseInfo> {
// `operating_regions` is filtered by the current datanode in `collect_metadata`,
// so looking up by `region_id` is sufficient here.
if let Some(role) = operating_regions.get(&region_id) {
let region_lease_info = RegionLeaseInfo::operating(region_id, *role);
if *role != reported_role {
info!(
"The region {} on datanode {} is operating with role {:?}, but reported as {:?}",
region_id, datanode_id, role, reported_role
);
}
return Some(region_lease_info);
}
None
}
@@ -147,49 +160,51 @@ impl RegionLeaseKeeper {
}
/// Returns [None] if:
/// - The region doesn't belong to the datanode.
/// - The region doesn't belong to the datanode in metadata or operating regions.
/// - The region belongs to a logical table.
fn renew_region_lease(
&self,
table_metadata: &HashMap<TableId, TableRouteValue>,
operating_regions: &HashSet<RegionId>,
operating_regions: &HashMap<RegionId, RegionRole>,
datanode_id: DatanodeId,
region_id: RegionId,
role: RegionRole,
reported_role: RegionRole,
) -> Option<RegionLeaseInfo> {
if operating_regions.contains(&region_id) {
let region_lease_info = RegionLeaseInfo::operating(region_id, role);
// First try to renew via region route
if let Some(table_route) = table_metadata.get(&region_id.table_id())
&& let Ok(Some(region_route)) = table_route.region_route(region_id)
&& let Some(region_lease) =
renew_region_lease_via_region_route(&region_route, datanode_id, region_id)
{
return Some(RegionLeaseInfo::from(region_lease));
}
// Then try to renew via operating regions, which covers the opening region without region route in metadata.
if let Some(region_lease_info) = renew_region_lease_via_operating_regions(
operating_regions,
datanode_id,
region_id,
reported_role,
) {
return Some(region_lease_info);
}
if let Some(table_route) = table_metadata.get(&region_id.table_id()) {
if let Ok(Some(region_route)) = table_route.region_route(region_id) {
return renew_region_lease_via_region_route(&region_route, datanode_id, region_id)
.map(RegionLeaseInfo::from);
} else {
warn!(
"Denied to renew region lease for datanode: {datanode_id}, region_id: {region_id}, region route is not found in table({})",
region_id.table_id()
);
}
} else {
warn!(
"Denied to renew region lease for datanode: {datanode_id}, region_id: {region_id}, table({}) is not found",
region_id.table_id()
);
}
warn!(
"Denied to renew region lease for datanode: {datanode_id}, region_id: {region_id}, no matching metadata or operating region found",
);
None
}
async fn collect_metadata(
&self,
datanode_id: DatanodeId,
mut region_ids: HashSet<RegionId>,
) -> Result<(HashMap<TableId, TableRouteValue>, HashSet<RegionId>)> {
// Filters out operating region first, improves the cache hit rate(reduce expensive remote fetches).
region_ids: HashSet<RegionId>,
) -> Result<(
HashMap<TableId, TableRouteValue>,
HashMap<RegionId, RegionRole>,
)> {
let operating_regions = self
.memory_region_keeper
.extract_operating_regions(datanode_id, &mut region_ids);
.extract_operating_region_roles(datanode_id, &region_ids);
let table_ids = region_ids
.into_iter()
.map(|region_id| region_id.table_id())
@@ -222,13 +237,13 @@ impl RegionLeaseKeeper {
let mut renewed = HashMap::new();
let mut non_exists = HashSet::new();
for &(region, role) in regions {
for &(region, reported_role) in regions {
match self.renew_region_lease(
&table_metadata,
&operating_regions,
datanode_id,
region,
role,
reported_role,
) {
Some(region_lease_info) => {
renewed.insert(region_lease_info.region_id, region_lease_info);
@@ -313,6 +328,12 @@ mod tests {
renew_region_lease_via_region_route(&region_route, leader_peer_id, region_id),
Some((region_id, RegionRole::DowngradingLeader))
);
region_route.leader_state = Some(LeaderState::Staging);
assert_eq!(
renew_region_lease_via_region_route(&region_route, leader_peer_id, region_id),
Some((region_id, RegionRole::StagingLeader))
);
}
#[tokio::test]
@@ -368,12 +389,16 @@ mod tests {
let opening_region_id = RegionId::new(1025, 1);
let _guard = keeper
.memory_region_keeper
.register(leader_peer_id, opening_region_id)
.register_with_role(leader_peer_id, opening_region_id, RegionRole::Leader)
.unwrap();
let another_opening_region_id = RegionId::new(1025, 2);
let _guard2 = keeper
.memory_region_keeper
.register(follower_peer_id, another_opening_region_id)
.register_with_role(
follower_peer_id,
another_opening_region_id,
RegionRole::Follower,
)
.unwrap();
let (metadata, regions) = keeper
@@ -387,8 +412,10 @@ mod tests {
metadata.keys().cloned().collect::<Vec<_>>(),
vec![region_id.table_id()]
);
assert!(regions.contains(&opening_region_id));
assert_eq!(regions.len(), 1);
assert_eq!(
regions,
HashMap::from([(opening_region_id, RegionRole::Leader)])
);
}
#[tokio::test]
@@ -473,17 +500,17 @@ mod tests {
let opening_region_id = RegionId::new(2048, 1);
let _guard = keeper
.memory_region_keeper
.register(leader_peer_id, opening_region_id)
.register_with_role(leader_peer_id, opening_region_id, RegionRole::Leader)
.unwrap();
// The opening region on the datanode.
// NOTES: The procedure lock will ensure only one opening leader.
for role in [RegionRole::Leader, RegionRole::Follower] {
for reported_role in [RegionRole::Leader, RegionRole::Follower] {
let RenewRegionLeasesResponse {
non_exists,
renewed,
} = keeper
.renew_region_leases(leader_peer_id, &[(opening_region_id, role)])
.renew_region_leases(leader_peer_id, &[(opening_region_id, reported_role)])
.await
.unwrap();
@@ -492,7 +519,7 @@ mod tests {
renewed,
HashMap::from([(
opening_region_id,
RegionLeaseInfo::operating(opening_region_id, role)
RegionLeaseInfo::operating(opening_region_id, RegionRole::Leader)
)])
);
}
@@ -581,4 +608,213 @@ mod tests {
);
}
}
#[tokio::test]
async fn test_renew_region_leases_reported_staging_expected_leader() {
let table_id = 1024;
let table_info: TableInfo = new_test_table_info(table_id);
let region_id = RegionId::new(table_id, 1);
let leader_peer_id = 1024;
let region_route = RegionRouteBuilder::default()
.region(Region::new_test(region_id))
.leader_peer(Peer::empty(leader_peer_id))
.build()
.unwrap();
let keeper = new_test_keeper();
let table_metadata_manager = keeper.table_metadata_manager();
table_metadata_manager
.create_table_metadata(
table_info,
TableRouteValue::physical(vec![region_route]),
HashMap::default(),
)
.await
.unwrap();
let RenewRegionLeasesResponse {
non_exists,
renewed,
} = keeper
.renew_region_leases(leader_peer_id, &[(region_id, RegionRole::StagingLeader)])
.await
.unwrap();
assert!(non_exists.is_empty());
assert_eq!(
renewed,
HashMap::from([(
region_id,
RegionLeaseInfo::from((region_id, RegionRole::Leader))
)])
);
}
#[tokio::test]
async fn test_renew_region_leases_reported_staging_expected_staging() {
let table_id = 1024;
let table_info: TableInfo = new_test_table_info(table_id);
let region_id = RegionId::new(table_id, 1);
let leader_peer_id = 1024;
let region_route = RegionRouteBuilder::default()
.region(Region::new_test(region_id))
.leader_peer(Peer::empty(leader_peer_id))
.leader_state(LeaderState::Staging)
.build()
.unwrap();
let keeper = new_test_keeper();
let table_metadata_manager = keeper.table_metadata_manager();
table_metadata_manager
.create_table_metadata(
table_info,
TableRouteValue::physical(vec![region_route]),
HashMap::default(),
)
.await
.unwrap();
let RenewRegionLeasesResponse {
non_exists,
renewed,
} = keeper
.renew_region_leases(leader_peer_id, &[(region_id, RegionRole::StagingLeader)])
.await
.unwrap();
assert!(non_exists.is_empty());
assert_eq!(
renewed,
HashMap::from([(
region_id,
RegionLeaseInfo::from((region_id, RegionRole::StagingLeader))
)])
);
}
#[tokio::test]
async fn test_renew_region_leases_metadata_role_beats_keeper_role() {
let table_id = 2048;
let table_info: TableInfo = new_test_table_info(table_id);
let datanode_id = 1024;
let region_id = RegionId::new(table_id, 1);
let region_route = RegionRouteBuilder::default()
.region(Region::new_test(region_id))
.leader_peer(Peer::empty(datanode_id))
.build()
.unwrap();
let keeper = new_test_keeper();
let table_metadata_manager = keeper.table_metadata_manager();
table_metadata_manager
.create_table_metadata(
table_info,
TableRouteValue::physical(vec![region_route]),
HashMap::default(),
)
.await
.unwrap();
let _guard = keeper
.memory_region_keeper
.register_with_role(datanode_id, region_id, RegionRole::Follower)
.unwrap();
let RenewRegionLeasesResponse {
non_exists,
renewed,
} = keeper
.renew_region_leases(datanode_id, &[(region_id, RegionRole::Follower)])
.await
.unwrap();
assert!(non_exists.is_empty());
assert_eq!(
renewed,
HashMap::from([(
region_id,
RegionLeaseInfo::from((region_id, RegionRole::Leader))
)])
);
}
#[tokio::test]
async fn test_renew_region_leases_missing_route_falls_back_to_keeper_role() {
let table_id = 2048;
let table_info: TableInfo = new_test_table_info(table_id);
let datanode_id = 1024;
let region_id = RegionId::new(table_id, 1);
let another_region_id = RegionId::new(table_id, 2);
let region_route = RegionRouteBuilder::default()
.region(Region::new_test(another_region_id))
.leader_peer(Peer::empty(datanode_id))
.build()
.unwrap();
let keeper = new_test_keeper();
let table_metadata_manager = keeper.table_metadata_manager();
table_metadata_manager
.create_table_metadata(
table_info,
TableRouteValue::physical(vec![region_route]),
HashMap::default(),
)
.await
.unwrap();
let _guard = keeper
.memory_region_keeper
.register_with_role(datanode_id, region_id, RegionRole::DowngradingLeader)
.unwrap();
let RenewRegionLeasesResponse {
non_exists,
renewed,
} = keeper
.renew_region_leases(datanode_id, &[(region_id, RegionRole::StagingLeader)])
.await
.unwrap();
assert!(non_exists.is_empty());
assert_eq!(
renewed,
HashMap::from([(
region_id,
RegionLeaseInfo::operating(region_id, RegionRole::DowngradingLeader)
)])
);
}
#[tokio::test]
async fn test_renew_region_leases_operating_region_uses_keeper_role() {
let keeper = new_test_keeper();
let datanode_id = 1024;
let region_id = RegionId::new(2048, 1);
let _guard = keeper
.memory_region_keeper
.register_with_role(datanode_id, region_id, RegionRole::DowngradingLeader)
.unwrap();
let RenewRegionLeasesResponse {
non_exists,
renewed,
} = keeper
.renew_region_leases(datanode_id, &[(region_id, RegionRole::StagingLeader)])
.await
.unwrap();
assert!(non_exists.is_empty());
assert_eq!(
renewed,
HashMap::from([(
region_id,
RegionLeaseInfo::operating(region_id, RegionRole::DowngradingLeader)
)])
);
}
}

View File

@@ -33,7 +33,7 @@ itertools.workspace = true
lazy_static = "1.4"
mito-codec.workspace = true
mito2.workspace = true
moka.workspace = true
moka = { workspace = true, features = ["future"] }
object-store.workspace = true
prometheus.workspace = true
serde.workspace = true

View File

@@ -378,7 +378,7 @@ impl ManifestCache {
warn!(e; "Failed to remove empty root dir {}", dir.display());
return Err(e);
} else {
warn!("Empty root dir not found before removal {}", dir.display());
info!("Empty root dir not found before removal {}", dir.display());
}
} else {
info!(

View File

@@ -24,12 +24,13 @@ mod twcs;
mod window;
use std::collections::HashMap;
use std::sync::Arc;
use std::sync::{Arc, Mutex};
use std::time::Instant;
use api::v1::region::compact_request;
use api::v1::region::compact_request::Options;
use common_base::Plugins;
use common_base::cancellation::CancellationHandle;
use common_memory_manager::OnExhaustedPolicy;
use common_meta::key::SchemaMetadataManagerRef;
use common_telemetry::{debug, error, info, warn};
@@ -53,9 +54,9 @@ use crate::compaction::picker::{CompactionTask, PickerOutput, new_picker};
use crate::compaction::task::CompactionTaskImpl;
use crate::config::MitoConfig;
use crate::error::{
CompactRegionSnafu, Error, GetSchemaMetadataSnafu, ManualCompactionOverrideSnafu,
RegionClosedSnafu, RegionDroppedSnafu, RegionTruncatedSnafu, RemoteCompactionSnafu, Result,
TimeRangePredicateOverflowSnafu, TimeoutSnafu,
CompactRegionSnafu, CompactionCancelledSnafu, Error, GetSchemaMetadataSnafu,
ManualCompactionOverrideSnafu, RegionClosedSnafu, RegionDroppedSnafu, RegionTruncatedSnafu,
RemoteCompactionSnafu, Result, TimeRangePredicateOverflowSnafu, TimeoutSnafu,
};
use crate::metrics::{COMPACTION_STAGE_ELAPSED, INFLIGHT_COMPACTION_COUNT};
use crate::read::BoxedRecordBatchStream;
@@ -186,7 +187,7 @@ impl CompactionScheduler {
}
// The region can compact directly.
let mut status: CompactionStatus =
let mut status =
CompactionStatus::new(region_id, version_control.clone(), access_layer.clone());
let request = status.new_compaction_request(
self.request_sender.clone(),
@@ -199,17 +200,25 @@ impl CompactionScheduler {
max_parallelism,
);
let result = self
let result = match self
.schedule_compaction_request(request, compact_options)
.await;
if matches!(result, Ok(true)) {
// Only if the compaction request is scheduled successfully,
// we insert the region into the status map.
self.region_status.insert(region_id, status);
}
.await
{
Ok(Some(active_compaction)) => {
// Publish CompactionStatus only after a task has been accepted by the scheduler.
// This avoids exposing a half-initialized region status that could collect pending
// DDL/compaction state even though no compaction is actually running.
status.active_compaction = Some(active_compaction);
self.region_status.insert(region_id, status);
Ok(())
}
Ok(None) => Ok(()),
Err(e) => Err(e),
};
self.listener.on_compaction_scheduled(region_id);
result.map(|_| ())
result
}
// Handle pending manual compaction request for the region.
@@ -251,14 +260,16 @@ impl CompactionScheduler {
};
match self.schedule_compaction_request(request, options).await {
Ok(true) => {
Ok(Some(active_compaction)) => {
let status = self.region_status.get_mut(&region_id).unwrap();
status.active_compaction = Some(active_compaction);
debug!(
"Successfully scheduled manual compaction for region id: {}",
region_id
);
true
}
Ok(false) => {
Ok(None) => {
// We still need to handle the pending DDL requests.
// So we can't return early here.
false
@@ -278,6 +289,11 @@ impl CompactionScheduler {
manifest_ctx: &ManifestContextRef,
schema_metadata_manager: SchemaMetadataManagerRef,
) -> Vec<SenderDdlRequest> {
let Some(status) = self.region_status.get_mut(&region_id) else {
return Vec::new();
};
status.clear_running_task();
// If there a pending compaction request, handle it first
// and defer returning the pending DDL requests to the caller.
if self
@@ -297,7 +313,6 @@ impl CompactionScheduler {
return Vec::new();
};
// Notify all waiters that compaction is finished.
for waiter in std::mem::take(&mut status.waiters) {
waiter.send(Ok(0));
}
@@ -331,13 +346,17 @@ impl CompactionScheduler {
)
.await
{
Ok(true) => {
Ok(Some(active_compaction)) => {
self.region_status
.get_mut(&region_id)
.unwrap()
.active_compaction = Some(active_compaction);
debug!(
"Successfully scheduled next compaction for region id: {}",
region_id
);
}
Ok(false) => {
Ok(None) => {
// No further compaction tasks can be scheduled; cleanup the `CompactionStatus` for this region.
// All DDL requests and pending compaction requests have already been processed.
// Safe to remove the region from status tracking.
@@ -352,6 +371,14 @@ impl CompactionScheduler {
Vec::new()
}
/// Notifies the scheduler that the compaction job is cancelled cooperatively.
pub(crate) async fn on_compaction_cancelled(
&mut self,
region_id: RegionId,
) -> Vec<SenderDdlRequest> {
self.remove_region_on_cancel(region_id)
}
/// Notifies the scheduler that the compaction job is failed.
pub(crate) fn on_compaction_failed(&mut self, region_id: RegionId, err: Arc<Error>) {
error!(err; "Region {} failed to compact, cancel all pending tasks", region_id);
@@ -406,20 +433,23 @@ impl CompactionScheduler {
has_pending
}
/// Returns true if the region is compacting.
pub(crate) fn is_compacting(&self, region_id: RegionId) -> bool {
self.region_status.contains_key(&region_id)
pub(crate) fn request_cancel(&mut self, region_id: RegionId) -> RequestCancelResult {
let Some(status) = self.region_status.get_mut(&region_id) else {
return RequestCancelResult::NotRunning;
};
status.request_cancel()
}
/// Schedules a compaction request.
///
/// Returns true if the compaction request is scheduled successfully.
/// Returns false if no compaction task can be scheduled for this region.
/// Returns the active compaction state if the request is scheduled successfully.
/// Returns `None` if no compaction task can be scheduled for this region.
async fn schedule_compaction_request(
&mut self,
request: CompactionRequest,
options: compact_request::Options,
) -> Result<bool> {
) -> Result<Option<ActiveCompaction>> {
let region_id = request.region_id();
let (dynamic_compaction_opts, ttl) = find_dynamic_options(
region_id.table_id(),
@@ -492,7 +522,7 @@ impl CompactionScheduler {
for waiter in waiters {
waiter.send(Ok(0));
}
return Ok(false);
return Ok(None);
};
// If specified to run compaction remotely, we schedule the compaction job remotely.
@@ -523,7 +553,7 @@ impl CompactionScheduler {
job_id, region_id
);
INFLIGHT_COMPACTION_COUNT.inc();
return Ok(true);
return Ok(Some(ActiveCompaction::Remote));
}
Err(e) => {
if !dynamic_compaction_opts.fallback_to_local() {
@@ -555,21 +585,25 @@ impl CompactionScheduler {
// Create a local compaction task.
let estimated_bytes = estimate_compaction_bytes(&picker_output);
let cancel_handle = Arc::new(CancellationHandle::default());
let state = LocalCompactionState::new(cancel_handle.clone());
let local_compaction_task = Box::new(CompactionTaskImpl {
state: state.clone(),
request_sender,
waiters,
start_time,
listener,
picker_output,
compaction_region,
compactor: Arc::new(DefaultCompactor {}),
compactor: Arc::new(DefaultCompactor::with_cancel_handle(cancel_handle.clone())),
memory_manager: self.memory_manager.clone(),
memory_policy: self.memory_policy,
estimated_memory_bytes: estimated_bytes,
});
self.submit_compaction_task(local_compaction_task, region_id)
.map(|_| true)
.map(|_| Some(ActiveCompaction::Local { state }))
}
fn submit_compaction_task(
@@ -597,6 +631,77 @@ impl CompactionScheduler {
// Notifies all pending tasks.
status.on_failure(err);
}
fn remove_region_on_cancel(&mut self, region_id: RegionId) -> Vec<SenderDdlRequest> {
let Some(status) = self.region_status.remove(&region_id) else {
return Vec::new();
};
status.on_cancel()
}
}
#[derive(Debug, Clone)]
pub(crate) struct LocalCompactionState {
cancel_handle: Arc<CancellationHandle>,
commit_started: Arc<Mutex<bool>>,
}
#[derive(Debug)]
enum ActiveCompaction {
Local { state: LocalCompactionState },
Remote,
}
impl LocalCompactionState {
fn new(cancel_handle: Arc<CancellationHandle>) -> Self {
Self {
cancel_handle,
commit_started: Arc::new(Mutex::new(false)),
}
}
/// Returns the cancellation handle for this compaction task.
pub(crate) fn cancel_handle(&self) -> Arc<CancellationHandle> {
self.cancel_handle.clone()
}
/// Marks the compaction task as started to commit,
/// which means the compaction task is in the final stage and is about to update region version and manifest.
/// It will reject cancellation request after this method is called.
///
/// Returns true if this is the first time to mark commit started, false otherwise.
pub(crate) fn mark_commit_started(&self) -> bool {
let mut commit_started = self.commit_started.lock().unwrap();
if self.cancel_handle.is_cancelled() {
return false;
}
*commit_started = true;
true
}
/// Request cancellation for this compaction task.
pub(crate) fn request_cancel(&self) -> RequestCancelResult {
// The cancel handle must under the lock of `commit_started` to avoid racing between cancellation and commit.
let commit_started = self.commit_started.lock().unwrap();
if *commit_started {
return RequestCancelResult::TooLateToCancel;
}
if self.cancel_handle.is_cancelled() {
return RequestCancelResult::AlreadyCancelling;
}
self.cancel_handle.cancel();
RequestCancelResult::CancelIssued
}
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub(crate) enum RequestCancelResult {
CancelIssued,
AlreadyCancelling,
TooLateToCancel,
NotRunning,
}
impl Drop for CompactionScheduler {
@@ -703,6 +808,8 @@ struct CompactionStatus {
pending_request: Option<PendingCompaction>,
/// Pending DDL requests that should run when compaction is done.
pending_ddl_requests: Vec<SenderDdlRequest>,
/// Active compaction state.
active_compaction: Option<ActiveCompaction>,
}
impl CompactionStatus {
@@ -719,9 +826,39 @@ impl CompactionStatus {
waiters: Vec::new(),
pending_request: None,
pending_ddl_requests: Vec::new(),
active_compaction: None,
}
}
#[cfg(test)]
fn start_local_task(&mut self) -> LocalCompactionState {
let state = LocalCompactionState::new(Arc::new(CancellationHandle::default()));
self.active_compaction = Some(ActiveCompaction::Local {
state: state.clone(),
});
state
}
#[cfg(test)]
fn start_remote_task(&mut self) {
self.active_compaction = Some(ActiveCompaction::Remote);
}
fn request_cancel(&mut self) -> RequestCancelResult {
let Some(active_compaction) = &self.active_compaction else {
return RequestCancelResult::NotRunning;
};
match active_compaction {
ActiveCompaction::Local { state, .. } => state.request_cancel(),
ActiveCompaction::Remote => RequestCancelResult::TooLateToCancel,
}
}
fn clear_running_task(&mut self) -> bool {
self.active_compaction.take().is_some()
}
/// Merge the waiter to the pending compaction.
fn merge_waiter(&mut self, mut waiter: OptionOutputTx) {
if let Some(waiter) = waiter.take_inner() {
@@ -764,6 +901,23 @@ impl CompactionStatus {
}
}
#[must_use]
fn on_cancel(mut self) -> Vec<SenderDdlRequest> {
for waiter in self.waiters.drain(..) {
waiter.send(CompactionCancelledSnafu.fail());
}
if let Some(pending_compaction) = self.pending_request {
pending_compaction.waiter.send(
Err(Arc::new(CompactionCancelledSnafu.build())).context(CompactRegionSnafu {
region_id: self.region_id,
}),
);
}
std::mem::take(&mut self.pending_ddl_requests)
}
/// Creates a new compaction request for compaction picker.
///
/// It consumes all pending compaction waiters.
@@ -1362,6 +1516,58 @@ mod tests {
);
}
#[tokio::test]
async fn test_schedule_compaction_does_not_publish_status_when_schedule_fails() {
common_telemetry::init_default_ut_logging();
let env = SchedulerEnv::new()
.await
.scheduler(Arc::new(FailingScheduler));
let (tx, _rx) = mpsc::channel(4);
let mut scheduler = env.mock_compaction_scheduler(tx);
let mut builder = VersionControlBuilder::new();
let end = 1000 * 1000;
let version_control = Arc::new(
builder
.push_l0_file(0, end)
.push_l0_file(10, end)
.push_l0_file(50, end)
.push_l0_file(80, end)
.push_l0_file(90, end)
.build(),
);
let region_id = builder.region_id();
let manifest_ctx = env
.mock_manifest_context(version_control.current().version.metadata.clone())
.await;
let (schema_metadata_manager, kv_backend) = mock_schema_metadata_manager();
schema_metadata_manager
.register_region_table_info(
builder.region_id().table_id(),
"test_table",
"test_catalog",
"test_schema",
None,
kv_backend,
)
.await;
let result = scheduler
.schedule_compaction(
region_id,
compact_request::Options::Regular(Default::default()),
&version_control,
&env.access_layer,
OptionOutputTx::none(),
&manifest_ctx,
schema_metadata_manager,
1,
)
.await;
assert!(result.is_err());
assert!(!scheduler.region_status.contains_key(&region_id));
}
#[tokio::test]
async fn test_manual_compaction_when_compaction_in_progress() {
common_telemetry::init_default_ut_logging();
@@ -1542,6 +1748,11 @@ mod tests {
region_id,
CompactionStatus::new(region_id, version_control, env.access_layer.clone()),
);
scheduler
.region_status
.get_mut(&region_id)
.unwrap()
.start_local_task();
let (output_tx, _output_rx) = oneshot::channel();
scheduler.add_ddl_request_to_pending(SenderDdlRequest {
@@ -1558,6 +1769,142 @@ mod tests {
assert!(scheduler.has_pending_ddls(region_id));
}
#[tokio::test]
async fn test_request_cancel_state_transitions() {
let env = SchedulerEnv::new().await;
let builder = VersionControlBuilder::new();
let region_id = builder.region_id();
let version_control = Arc::new(builder.build());
let mut status =
CompactionStatus::new(region_id, version_control, env.access_layer.clone());
let state = status.start_local_task();
assert_eq!(status.request_cancel(), RequestCancelResult::CancelIssued);
assert!(state.cancel_handle().is_cancelled());
assert_eq!(
status.request_cancel(),
RequestCancelResult::AlreadyCancelling
);
assert!(!state.mark_commit_started());
assert_eq!(
status.request_cancel(),
RequestCancelResult::AlreadyCancelling
);
assert!(status.clear_running_task());
assert_eq!(status.request_cancel(), RequestCancelResult::NotRunning);
}
#[tokio::test]
async fn test_request_cancel_remote_compaction_is_too_late() {
let env = SchedulerEnv::new().await;
let builder = VersionControlBuilder::new();
let region_id = builder.region_id();
let version_control = Arc::new(builder.build());
let mut status =
CompactionStatus::new(region_id, version_control, env.access_layer.clone());
status.start_remote_task();
assert_eq!(
status.request_cancel(),
RequestCancelResult::TooLateToCancel
);
assert!(status.active_compaction.is_some());
}
#[tokio::test]
async fn test_on_compaction_cancelled_returns_pending_ddl_requests() {
let job_scheduler = Arc::new(VecScheduler::default());
let env = SchedulerEnv::new().await.scheduler(job_scheduler.clone());
let (tx, _rx) = mpsc::channel(4);
let mut scheduler = env.mock_compaction_scheduler(tx);
let builder = VersionControlBuilder::new();
let version_control = Arc::new(builder.build());
let region_id = builder.region_id();
let _manifest_ctx = env
.mock_manifest_context(version_control.current().version.metadata.clone())
.await;
let (_schema_metadata_manager, _kv_backend) = mock_schema_metadata_manager();
scheduler.region_status.insert(
region_id,
CompactionStatus::new(region_id, version_control, env.access_layer.clone()),
);
scheduler
.region_status
.get_mut(&region_id)
.unwrap()
.start_local_task();
let (output_tx, _output_rx) = oneshot::channel();
scheduler.add_ddl_request_to_pending(SenderDdlRequest {
region_id,
sender: OptionOutputTx::from(output_tx),
request: crate::request::DdlRequest::EnterStaging(
store_api::region_request::EnterStagingRequest {
partition_directive:
store_api::region_request::StagingPartitionDirective::RejectAllWrites,
},
),
});
let pending_ddls = scheduler.on_compaction_cancelled(region_id).await;
assert_eq!(pending_ddls.len(), 1);
assert!(!scheduler.has_pending_ddls(region_id));
assert!(!scheduler.region_status.contains_key(&region_id));
assert_eq!(job_scheduler.num_jobs(), 0);
}
#[tokio::test]
async fn test_on_compaction_cancelled_prioritizes_pending_ddls_over_pending_compaction() {
let job_scheduler = Arc::new(VecScheduler::default());
let env = SchedulerEnv::new().await.scheduler(job_scheduler.clone());
let (tx, _rx) = mpsc::channel(4);
let mut scheduler = env.mock_compaction_scheduler(tx);
let builder = VersionControlBuilder::new();
let version_control = Arc::new(builder.build());
let region_id = builder.region_id();
let _manifest_ctx = env
.mock_manifest_context(version_control.current().version.metadata.clone())
.await;
let (_schema_metadata_manager, _kv_backend) = mock_schema_metadata_manager();
scheduler.region_status.insert(
region_id,
CompactionStatus::new(region_id, version_control, env.access_layer.clone()),
);
let status = scheduler.region_status.get_mut(&region_id).unwrap();
status.start_local_task();
let (manual_tx, manual_rx) = oneshot::channel();
status.set_pending_request(PendingCompaction {
options: compact_request::Options::StrictWindow(StrictWindow { window_seconds: 60 }),
waiter: OptionOutputTx::from(manual_tx),
max_parallelism: 1,
});
let (output_tx, _output_rx) = oneshot::channel();
scheduler.add_ddl_request_to_pending(SenderDdlRequest {
region_id,
sender: OptionOutputTx::from(output_tx),
request: crate::request::DdlRequest::EnterStaging(
store_api::region_request::EnterStagingRequest {
partition_directive:
store_api::region_request::StagingPartitionDirective::RejectAllWrites,
},
),
});
let pending_ddls = scheduler.on_compaction_cancelled(region_id).await;
assert_eq!(pending_ddls.len(), 1);
assert!(!scheduler.region_status.contains_key(&region_id));
assert_eq!(job_scheduler.num_jobs(), 0);
assert_matches!(manual_rx.await.unwrap(), Err(_));
}
#[tokio::test]
async fn test_pending_ddl_request_failed_on_compaction_failed() {
let env = SchedulerEnv::new().await;
@@ -1713,6 +2060,11 @@ mod tests {
region_id,
CompactionStatus::new(region_id, version_control, env.access_layer.clone()),
);
scheduler
.region_status
.get_mut(&region_id)
.unwrap()
.start_local_task();
let (output_tx, _output_rx) = oneshot::channel();
scheduler.add_ddl_request_to_pending(SenderDdlRequest {
@@ -1752,6 +2104,7 @@ mod tests {
let (manual_tx, manual_rx) = oneshot::channel();
let mut status =
CompactionStatus::new(region_id, version_control.clone(), env.access_layer.clone());
status.start_local_task();
status.set_pending_request(PendingCompaction {
options: compact_request::Options::Regular(Default::default()),
waiter: OptionOutputTx::from(manual_tx),
@@ -1827,6 +2180,7 @@ mod tests {
let (manual_tx, manual_rx) = oneshot::channel();
let mut status =
CompactionStatus::new(region_id, version_control.clone(), env.access_layer.clone());
status.start_local_task();
status.set_pending_request(PendingCompaction {
options: compact_request::Options::Regular(Default::default()),
waiter: OptionOutputTx::from(manual_tx),
@@ -1873,6 +2227,11 @@ mod tests {
region_id,
CompactionStatus::new(region_id, version_control, env.access_layer.clone()),
);
scheduler
.region_status
.get_mut(&region_id)
.unwrap()
.start_local_task();
let pending_ddls = scheduler
.on_compaction_finished(region_id, &manifest_ctx, schema_metadata_manager)
@@ -1910,6 +2269,11 @@ mod tests {
region_id,
CompactionStatus::new(region_id, version_control, env.access_layer.clone()),
);
scheduler
.region_status
.get_mut(&region_id)
.unwrap()
.start_local_task();
let pending_ddls = scheduler
.on_compaction_finished(region_id, &manifest_ctx, schema_metadata_manager)

View File

@@ -17,8 +17,9 @@ use std::sync::Arc;
use std::time::Duration;
use api::v1::region::compact_request;
use common_base::cancellation::{CancellableFuture, CancellationHandle};
use common_meta::key::SchemaMetadataManagerRef;
use common_telemetry::{info, warn};
use common_telemetry::{debug, info, warn};
use common_time::TimeToLive;
use either::Either;
use itertools::Itertools;
@@ -38,11 +39,10 @@ use crate::compaction::picker::{PickerOutput, new_picker};
use crate::compaction::{CompactionOutput, CompactionSstReaderBuilder, find_dynamic_options};
use crate::config::MitoConfig;
use crate::error::{
EmptyRegionDirSnafu, InvalidPartitionExprSnafu, JoinSnafu, ObjectStoreNotFoundSnafu, Result,
EmptyRegionDirSnafu, InvalidPartitionExprSnafu, ObjectStoreNotFoundSnafu, Result,
};
use crate::manifest::action::{RegionEdit, RegionMetaAction, RegionMetaActionList};
use crate::manifest::manager::{RegionManifestManager, RegionManifestOptions};
use crate::metrics;
use crate::read::FlatSource;
use crate::region::options::RegionOptions;
use crate::region::version::VersionRef;
@@ -56,6 +56,7 @@ use crate::sst::index::puffin_manager::PuffinManagerFactory;
use crate::sst::location::region_dir_from_table_dir;
use crate::sst::parquet::WriteOptions;
use crate::sst::version::{SstVersion, SstVersionRef};
use crate::{error, metrics};
/// Region version for compaction that does not hold memtables.
#[derive(Clone)]
@@ -299,12 +300,28 @@ pub trait Compactor: Send + Sync + 'static {
) -> Result<()>;
}
/// DefaultCompactor is the default implementation of Compactor.
pub struct DefaultCompactor;
impl DefaultCompactor {
/// Merge a single compaction output into SST files.
/// Trait for merging a single compaction output into SST files.
///
/// This is extracted from `DefaultCompactor` to allow injecting mock
/// implementations in tests.
#[async_trait::async_trait]
pub trait SstMerger: Send + Sync + 'static {
async fn merge_single_output(
&self,
compaction_region: CompactionRegion,
output: CompactionOutput,
write_opts: WriteOptions,
) -> Result<Vec<FileMeta>>;
}
/// The production [`SstMerger`] that reads, merges, and writes SST files.
#[derive(Clone)]
pub struct DefaultSstMerger;
#[async_trait::async_trait]
impl SstMerger for DefaultSstMerger {
async fn merge_single_output(
&self,
compaction_region: CompactionRegion,
output: CompactionOutput,
write_opts: WriteOptions,
@@ -424,54 +441,145 @@ impl DefaultCompactor {
}
}
/// DefaultCompactor is the default implementation of Compactor.
///
/// It is parameterized by an [`SstMerger`] to allow injecting mock
/// implementations in tests.
pub struct DefaultCompactor<M = DefaultSstMerger> {
merger: M,
cancel_handle: Arc<CancellationHandle>,
}
#[cfg(test)]
impl<M: SstMerger> DefaultCompactor<M> {
pub fn with_merger(merger: M) -> Self {
Self {
merger,
cancel_handle: Arc::new(CancellationHandle::default()),
}
}
}
impl DefaultCompactor {
pub fn with_cancel_handle(cancel_handle: Arc<CancellationHandle>) -> Self {
Self {
merger: DefaultSstMerger,
cancel_handle,
}
}
}
#[async_trait::async_trait]
impl Compactor for DefaultCompactor {
impl<M: SstMerger> Compactor for DefaultCompactor<M>
where
M: Clone,
{
async fn merge_ssts(
&self,
compaction_region: &CompactionRegion,
mut picker_output: PickerOutput,
) -> Result<MergeOutput> {
let mut futs = Vec::with_capacity(picker_output.outputs.len());
let mut compacted_inputs =
Vec::with_capacity(picker_output.outputs.iter().map(|o| o.inputs.len()).sum());
let internal_parallelism = compaction_region.max_parallelism.max(1);
let compaction_time_window = picker_output.time_window_size;
let region_id = compaction_region.region_id;
// Build tasks along with their input file metas so we can track which
// inputs correspond to each task.
let mut tasks: Vec<(Vec<FileMeta>, _)> = Vec::with_capacity(picker_output.outputs.len());
for output in picker_output.outputs.drain(..) {
let inputs_to_remove: Vec<_> =
output.inputs.iter().map(|f| f.meta_ref().clone()).collect();
compacted_inputs.extend(inputs_to_remove.iter().cloned());
let write_opts = WriteOptions {
write_buffer_size: compaction_region.engine_config.sst_write_buffer_size,
max_file_size: picker_output.max_file_size,
..Default::default()
};
futs.push(Self::merge_single_output(
compaction_region.clone(),
output,
write_opts,
));
}
let mut output_files = Vec::with_capacity(futs.len());
while !futs.is_empty() {
let mut task_chunk = Vec::with_capacity(internal_parallelism);
for _ in 0..internal_parallelism {
if let Some(task) = futs.pop() {
task_chunk.push(common_runtime::spawn_compact(task));
}
}
let metas = futures::future::try_join_all(task_chunk)
.await
.context(JoinSnafu)?
.into_iter()
.collect::<Result<Vec<Vec<_>>>>()?;
output_files.extend(metas.into_iter().flatten());
let merger = self.merger.clone();
let compaction_region = compaction_region.clone();
let fut = async move {
merger
.merge_single_output(compaction_region, output, write_opts)
.await
};
tasks.push((inputs_to_remove, fut));
}
// In case of remote compaction, we still allow the region edit after merge to
// clean expired ssts.
let mut inputs: Vec<_> = compacted_inputs.into_iter().collect();
inputs.extend(
let mut output_files = Vec::with_capacity(tasks.len());
let mut compacted_inputs = Vec::with_capacity(
tasks.iter().map(|(inputs, _)| inputs.len()).sum::<usize>()
+ picker_output.expired_ssts.len(),
);
while !tasks.is_empty() {
let mut chunk: Vec<(Vec<FileMeta>, _)> = Vec::with_capacity(internal_parallelism);
for _ in 0..internal_parallelism {
if let Some(task) = tasks.pop() {
chunk.push(task);
}
}
let mut spawned: Vec<_> = chunk
.into_iter()
.map(|(inputs, fut)| {
let handle = common_runtime::spawn_compact(fut);
(inputs, handle)
})
.collect();
while let Some((inputs, handle)) = spawned.pop() {
let abort_handle = handle.abort_handle();
match CancellableFuture::new(handle, self.cancel_handle.clone()).await {
Ok(Ok(Ok(files))) => {
output_files.extend(files);
compacted_inputs.extend(inputs);
}
Ok(Ok(Err(e))) => {
warn!(
e; "Failed to merge compaction output for region: {}, inputs: [{}]",
region_id,
inputs.iter().map(|f| f.file_id.to_string()).join(",")
);
}
Ok(Err(e)) => {
warn!(
"Region {} compaction task join error for inputs: [{}], skipping: {}",
region_id,
inputs.iter().map(|f| f.file_id.to_string()).join(","),
e
);
// If the cancel handle is cancelled,
// cancel the remaining tasks before returns the error.
if self.cancel_handle.is_cancelled() {
abort_handle.abort();
for (_, handle) in spawned {
handle.abort();
}
}
return Err(e).context(error::JoinSnafu);
}
Err(_) => {
debug!(
"Compaction merge cancelled for region: {}, aborting remaining {} spawned tasks",
region_id,
spawned.len(),
);
abort_handle.abort();
for (_, handle) in spawned {
handle.abort();
}
break;
}
}
}
if self.cancel_handle.is_cancelled() {
info!("Compaction merge cancelled for region: {}", region_id);
break;
}
}
// Include expired SSTs in removals — these don't depend on merge success.
compacted_inputs.extend(
picker_output
.expired_ssts
.iter()
@@ -480,7 +588,7 @@ impl Compactor for DefaultCompactor {
Ok(MergeOutput {
files_to_add: output_files,
files_to_remove: inputs,
files_to_remove: compacted_inputs,
compaction_time_window: Some(compaction_time_window),
})
}
@@ -558,3 +666,328 @@ impl Compactor for DefaultCompactor {
Ok(())
}
}
#[cfg(test)]
mod tests {
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::{Arc, Mutex};
use std::time::Duration;
use store_api::storage::{FileId, RegionId};
use tokio::time::sleep;
use super::{DefaultCompactor, *};
use crate::cache::CacheManager;
use crate::compaction::picker::PickerOutput;
use crate::error::Result;
use crate::sst::file::FileHandle;
use crate::sst::file_purger::NoopFilePurger;
use crate::sst::version::SstVersion;
use crate::test_util::memtable_util::metadata_for_test;
use crate::test_util::scheduler_util::SchedulerEnv;
fn dummy_file_meta() -> FileMeta {
FileMeta {
region_id: RegionId::new(1, 1),
file_id: FileId::random(),
file_size: 100,
..Default::default()
}
}
fn new_file_handle(meta: FileMeta) -> FileHandle {
FileHandle::new(meta, Arc::new(NoopFilePurger))
}
/// Build a minimal [`CompactionRegion`] suitable for tests where the
/// [`SstMerger`] is mocked and never touches the access layer.
async fn new_test_compaction_region() -> CompactionRegion {
let env = SchedulerEnv::new().await;
let metadata = metadata_for_test();
let manifest_ctx = env.mock_manifest_context(metadata.clone()).await;
CompactionRegion {
region_id: RegionId::new(1, 1),
region_options: RegionOptions::default(),
engine_config: Arc::new(MitoConfig::default()),
region_metadata: metadata.clone(),
cache_manager: Arc::new(CacheManager::default()),
access_layer: env.access_layer.clone(),
manifest_ctx,
current_version: CompactionVersion {
metadata,
options: RegionOptions::default(),
ssts: Arc::new(SstVersion::new()),
compaction_time_window: None,
},
file_purger: None,
ttl: None,
max_parallelism: 1,
}
}
/// An [`SstMerger`] that returns pre-configured results per call index.
///
/// Call 0 gets `results[0]`, call 1 gets `results[1]`, etc.
#[derive(Clone)]
struct MockMerger {
results: Arc<Mutex<Vec<Result<Vec<FileMeta>>>>>,
call_idx: Arc<AtomicUsize>,
}
impl MockMerger {
fn new(results: Vec<Result<Vec<FileMeta>>>) -> Self {
Self {
results: Arc::new(Mutex::new(results)),
call_idx: Arc::new(AtomicUsize::new(0)),
}
}
}
#[async_trait::async_trait]
impl SstMerger for MockMerger {
async fn merge_single_output(
&self,
_compaction_region: CompactionRegion,
_output: CompactionOutput,
_write_opts: WriteOptions,
) -> Result<Vec<FileMeta>> {
let idx = self.call_idx.fetch_add(1, Ordering::SeqCst);
match self.results.lock().unwrap().get(idx) {
Some(Ok(files)) => Ok(files.clone()),
Some(Err(_)) => error::InvalidMetaSnafu {
reason: format!("simulated failure at index {idx}"),
}
.fail(),
None => panic!("MockMerger: no result configured for call index {idx}"),
}
}
}
#[tokio::test]
async fn test_partial_merge_failure_collects_only_successful_outputs() {
common_telemetry::init_default_ut_logging();
let compaction_region = new_test_compaction_region().await;
// Prepare 3 compaction outputs: output 0 and 2 succeed, output 1 fails.
let input_meta_0 = dummy_file_meta();
let input_meta_1 = dummy_file_meta();
let input_meta_2 = dummy_file_meta();
let output_meta_0 = vec![dummy_file_meta()];
let output_meta_2 = vec![dummy_file_meta(), dummy_file_meta()];
let merger = MockMerger::new(vec![
Ok(output_meta_0.clone()),
Err(error::InvalidMetaSnafu {
reason: "boom".to_string(),
}
.build()),
Ok(output_meta_2.clone()),
]);
let compactor = DefaultCompactor::with_merger(merger);
let picker_output = PickerOutput {
outputs: vec![
CompactionOutput {
output_level: 1,
inputs: vec![new_file_handle(input_meta_0.clone())],
filter_deleted: false,
output_time_range: None,
},
CompactionOutput {
output_level: 1,
inputs: vec![new_file_handle(input_meta_1.clone())],
filter_deleted: false,
output_time_range: None,
},
CompactionOutput {
output_level: 1,
inputs: vec![new_file_handle(input_meta_2.clone())],
filter_deleted: false,
output_time_range: None,
},
],
expired_ssts: vec![],
time_window_size: 3600,
max_file_size: None,
};
let merge_output = compactor
.merge_ssts(&compaction_region, picker_output)
.await
.unwrap();
// Outputs 0 and 2 succeeded (1 + 2 = 3 files added).
assert_eq!(merge_output.files_to_add.len(), 3);
// Only inputs from successful merges should be removed.
assert_eq!(merge_output.files_to_remove.len(), 2);
let removed_ids: Vec<_> = merge_output
.files_to_remove
.iter()
.map(|f| f.file_id)
.collect();
assert!(removed_ids.contains(&input_meta_0.file_id));
assert!(removed_ids.contains(&input_meta_2.file_id));
// The failed output's input must NOT be removed.
assert!(!removed_ids.contains(&input_meta_1.file_id));
}
#[tokio::test]
async fn test_all_outputs_succeed() {
common_telemetry::init_default_ut_logging();
let compaction_region = new_test_compaction_region().await;
let input_meta = dummy_file_meta();
let output_meta = vec![dummy_file_meta()];
let merger = MockMerger::new(vec![Ok(output_meta.clone())]);
let compactor = DefaultCompactor::with_merger(merger);
let picker_output = PickerOutput {
outputs: vec![CompactionOutput {
output_level: 1,
inputs: vec![new_file_handle(input_meta.clone())],
filter_deleted: false,
output_time_range: None,
}],
expired_ssts: vec![],
time_window_size: 3600,
max_file_size: None,
};
let merge_output = compactor
.merge_ssts(&compaction_region, picker_output)
.await
.unwrap();
assert_eq!(merge_output.files_to_add.len(), 1);
assert_eq!(merge_output.files_to_add[0].file_id, output_meta[0].file_id);
assert_eq!(merge_output.files_to_remove.len(), 1);
assert_eq!(merge_output.files_to_remove[0].file_id, input_meta.file_id);
}
#[tokio::test]
async fn test_expired_ssts_always_removed() {
common_telemetry::init_default_ut_logging();
let compaction_region = new_test_compaction_region().await;
let input_meta = dummy_file_meta();
let expired_meta = dummy_file_meta();
// The single merge output fails, but expired SSTs should still be removed.
let merger = MockMerger::new(vec![Err(error::InvalidMetaSnafu {
reason: "fail".to_string(),
}
.build())]);
let compactor = DefaultCompactor::with_merger(merger);
let picker_output = PickerOutput {
outputs: vec![CompactionOutput {
output_level: 1,
inputs: vec![new_file_handle(input_meta.clone())],
filter_deleted: false,
output_time_range: None,
}],
expired_ssts: vec![new_file_handle(expired_meta.clone())],
time_window_size: 3600,
max_file_size: None,
};
let merge_output = compactor
.merge_ssts(&compaction_region, picker_output)
.await
.unwrap();
// No files added (merge failed).
assert!(merge_output.files_to_add.is_empty());
// Only the expired SST should be in files_to_remove (not the failed merge's input).
assert_eq!(merge_output.files_to_remove.len(), 1);
assert_eq!(
merge_output.files_to_remove[0].file_id,
expired_meta.file_id
);
}
#[derive(Clone)]
struct BlockingMerger {
call_idx: Arc<AtomicUsize>,
}
#[async_trait::async_trait]
impl SstMerger for BlockingMerger {
async fn merge_single_output(
&self,
_compaction_region: CompactionRegion,
_output: CompactionOutput,
_write_opts: WriteOptions,
) -> Result<Vec<FileMeta>> {
self.call_idx.fetch_add(1, Ordering::SeqCst);
std::future::pending().await
}
}
#[tokio::test(flavor = "multi_thread")]
async fn test_merge_ssts_cancels_spawned_tasks() {
common_telemetry::init_default_ut_logging();
let mut compaction_region = new_test_compaction_region().await;
compaction_region.max_parallelism = 2;
let cancel_handle = Arc::new(CancellationHandle::default());
let call_idx = Arc::new(AtomicUsize::new(0));
let compactor = DefaultCompactor {
merger: BlockingMerger {
call_idx: call_idx.clone(),
},
cancel_handle: cancel_handle.clone(),
};
let picker_output = PickerOutput {
outputs: vec![
CompactionOutput {
output_level: 1,
inputs: vec![new_file_handle(dummy_file_meta())],
filter_deleted: false,
output_time_range: None,
},
CompactionOutput {
output_level: 1,
inputs: vec![new_file_handle(dummy_file_meta())],
filter_deleted: false,
output_time_range: None,
},
CompactionOutput {
output_level: 1,
inputs: vec![new_file_handle(dummy_file_meta())],
filter_deleted: false,
output_time_range: None,
},
],
expired_ssts: vec![],
time_window_size: 3600,
max_file_size: None,
};
let task = tokio::spawn(async move {
compactor
.merge_ssts(&compaction_region, picker_output)
.await
});
sleep(Duration::from_millis(100)).await;
cancel_handle.cancel();
let merge_output = task
.await
.expect("merge_ssts should stop after cancellation")
.unwrap();
let started = call_idx.load(Ordering::SeqCst);
assert!(merge_output.files_to_add.is_empty());
assert!(merge_output.files_to_remove.is_empty());
assert_eq!(started, 2);
}
}

View File

@@ -16,13 +16,15 @@ use std::fmt::{Debug, Formatter};
use std::sync::Arc;
use std::time::Instant;
use common_base::cancellation::CancellableFuture;
use common_memory_manager::OnExhaustedPolicy;
use common_telemetry::{error, info, warn};
use itertools::Itertools;
use snafu::ResultExt;
use tokio::sync::mpsc;
use crate::compaction::compactor::{CompactionRegion, Compactor};
use crate::compaction::LocalCompactionState;
use crate::compaction::compactor::{CompactionRegion, Compactor, MergeOutput};
use crate::compaction::memory_manager::{CompactionMemoryGuard, CompactionMemoryManager};
use crate::compaction::picker::{CompactionTask, PickerOutput};
use crate::error::{CompactRegionSnafu, CompactionMemoryExhaustedSnafu};
@@ -30,8 +32,8 @@ use crate::manifest::action::{RegionEdit, RegionMetaAction, RegionMetaActionList
use crate::metrics::{COMPACTION_FAILURE_COUNT, COMPACTION_MEMORY_WAIT, COMPACTION_STAGE_ELAPSED};
use crate::region::RegionRoleState;
use crate::request::{
BackgroundNotify, CompactionFailed, CompactionFinished, OutputTx, RegionEditResult,
WorkerRequest, WorkerRequestWithTime,
BackgroundNotify, CompactionCancelled, CompactionFailed, CompactionFinished, OutputTx,
RegionEditResult, WorkerRequest, WorkerRequestWithTime,
};
use crate::sst::file::FileMeta;
use crate::worker::WorkerListener;
@@ -41,6 +43,8 @@ use crate::{error, metrics};
pub const MAX_PARALLEL_COMPACTION: usize = 1;
pub(crate) struct CompactionTaskImpl {
/// Shared local-compaction state for cooperative cancellation.
pub(crate) state: LocalCompactionState,
pub compaction_region: CompactionRegion,
/// Request sender to notify the worker.
pub(crate) request_sender: mpsc::Sender<WorkerRequestWithTime>,
@@ -184,9 +188,7 @@ impl CompactionTaskImpl {
);
}
async fn handle_expiration_and_compaction(&mut self) -> error::Result<RegionEdit> {
self.mark_files_compacting(true);
async fn handle_expiration(&mut self) {
// 1. In case of local compaction, we can delete expired ssts in advance.
if !self.picker_output.expired_ssts.is_empty() {
let remove_timer = COMPACTION_STAGE_ELAPSED
@@ -203,7 +205,9 @@ impl CompactionTaskImpl {
.await;
remove_timer.observe_duration();
}
}
async fn handle_compaction(&mut self) -> error::Result<MergeOutput> {
// 2. Merge inputs
let merge_timer = COMPACTION_STAGE_ELAPSED
.with_label_values(&["merge"])
@@ -239,6 +243,13 @@ impl CompactionTaskImpl {
.on_merge_ssts_finished(self.compaction_region.region_id)
.await;
Ok(compaction_result)
}
async fn update_manifest(
&self,
compaction_result: crate::compaction::compactor::MergeOutput,
) -> error::Result<RegionEdit> {
let _manifest_timer = COMPACTION_STAGE_ELAPSED
.with_label_values(&["write_manifest"])
.start_timer();
@@ -296,14 +307,61 @@ impl CompactionTask for CompactionTaskImpl {
}
};
let notify = match self.handle_expiration_and_compaction().await {
Ok(edit) => BackgroundNotify::CompactionFinished(CompactionFinished {
region_id: self.compaction_region.region_id,
senders: std::mem::take(&mut self.waiters),
start_time: self.start_time,
edit,
}),
Err(e) => {
// Marks files compacting before compaction and unmark after compaction (even if compaction is cancelled or failed), so that they won't be picked by other compaction tasks.
self.mark_files_compacting(true);
self.handle_expiration().await;
let cancel_handle = self.state.cancel_handle();
// Run compaction with cooperative cancellation.
let notify = match CancellableFuture::new(
async { self.handle_compaction().await },
cancel_handle,
)
.await
{
Ok(Ok(merge_output)) => {
// Stop accepting cancellation once we are about to publish the compaction edit.
if !self.state.mark_commit_started() {
let senders = std::mem::take(&mut self.waiters);
BackgroundNotify::CompactionCancelled(CompactionCancelled {
region_id: self.compaction_region.region_id,
senders,
})
} else {
match self.update_manifest(merge_output).await {
Ok(edit) => {
let senders = std::mem::take(&mut self.waiters);
BackgroundNotify::CompactionFinished(CompactionFinished {
region_id: self.compaction_region.region_id,
senders,
start_time: self.start_time,
edit,
})
}
Err(e) => {
error!(e; "Failed to compact region, region id: {}", self.compaction_region.region_id);
let err = Arc::new(e);
self.on_failure(err.clone());
BackgroundNotify::CompactionFailed(CompactionFailed {
region_id: self.compaction_region.region_id,
err,
})
}
}
}
}
Err(_) => {
info!(
"Compaction cancelled, region id: {}",
self.compaction_region.region_id
);
let senders = std::mem::take(&mut self.waiters);
BackgroundNotify::CompactionCancelled(CompactionCancelled {
region_id: self.compaction_region.region_id,
senders,
})
}
Ok(Err(e)) => {
error!(e; "Failed to compact region, region id: {}", self.compaction_region.region_id);
let err = Arc::new(e);
// notify compaction waiters
@@ -334,7 +392,7 @@ mod tests {
fn test_picker_output_with_expired_ssts() {
// Test that PickerOutput correctly includes expired_ssts
// This verifies that expired SSTs are properly identified and included
// in the picker output, which is then handled by handle_expiration_and_compaction
// in the picker output, which is then handled by handle_expiration()
let file_ids = (0..3).map(|_| FileId::random()).collect::<Vec<_>>();
let expired_ssts = vec![
@@ -382,6 +440,6 @@ mod tests {
//
// The behavior is tested indirectly through integration tests:
// - remove_expired() logs errors but doesn't stop compaction
// - handle_expiration_and_compaction() continues even if remove_expired() encounters errors
// - The function is designed to be non-blocking for compaction
// - handle_expiration() continues even if remove_expired() encounters errors
// - The expiration stage is designed to be non-blocking for compaction
}

View File

@@ -1114,13 +1114,9 @@ impl EngineInner {
}
fn role(&self, region_id: RegionId) -> Option<RegionRole> {
self.workers.get_region(region_id).map(|region| {
if region.is_follower() {
RegionRole::Follower
} else {
RegionRole::Leader
}
})
self.workers
.get_region(region_id)
.map(|region| region.region_role())
}
}

View File

@@ -19,10 +19,11 @@ use std::time::Duration;
use api::v1::helper::{row, tag_column_schema};
use api::v1::value::ValueData;
use api::v1::{ColumnDataType, Row, Rows, SemanticType};
use api::v1::{ColumnDataType, Row, Rows, SemanticType, Value};
use common_error::ext::ErrorExt;
use common_meta::ddl::utils::{parse_column_metadatas, parse_manifest_infos_from_extensions};
use common_recordbatch::RecordBatches;
use datafusion_expr::col;
use datatypes::prelude::ConcreteDataType;
use datatypes::schema::{ColumnSchema, FulltextAnalyzer, FulltextBackend, FulltextOptions};
use store_api::metadata::ColumnMetadata;
@@ -41,8 +42,8 @@ use crate::error;
use crate::sst::FormatType;
use crate::test_util::batch_util::sort_batches_and_print;
use crate::test_util::{
CreateRequestBuilder, TestEnv, build_rows, build_rows_for_key, flush_region, put_rows,
rows_schema,
CreateRequestBuilder, TestEnv, build_rows, build_rows_for_key,
column_metadata_to_column_schema, flush_region, put_rows, rows_schema,
};
async fn scan_check_after_alter(engine: &MitoEngine, region_id: RegionId, expected: &str) {
@@ -102,6 +103,54 @@ fn alter_column_fulltext_options() -> RegionAlterRequest {
}
}
fn add_nullable_field1() -> RegionAlterRequest {
RegionAlterRequest {
kind: AlterKind::AddColumns {
columns: vec![AddColumn {
column_metadata: ColumnMetadata {
column_schema: ColumnSchema::new(
"field_1",
ConcreteDataType::float64_datatype(),
true,
),
semantic_type: SemanticType::Field,
column_id: 3,
},
location: None,
}],
},
}
}
fn build_row_with_added_field(
metadata: &[ColumnMetadata],
tag_0: &str,
field_0: f64,
field_1: Option<f64>,
ts_millis: i64,
) -> Row {
let values = metadata
.iter()
.map(|column| match column.column_schema.name.as_str() {
"tag_0" => Value {
value_data: Some(ValueData::StringValue(tag_0.to_string())),
},
"field_0" => Value {
value_data: Some(ValueData::F64Value(field_0)),
},
"field_1" => Value {
value_data: field_1.map(ValueData::F64Value),
},
"ts" => Value {
value_data: Some(ValueData::TimestampMillisecondValue(ts_millis)),
},
name => panic!("unexpected column {name}"),
})
.collect();
Row { values }
}
fn check_region_version(
engine: &MitoEngine,
region_id: RegionId,
@@ -236,6 +285,105 @@ async fn test_alter_region_with_format(flat_format: bool) {
check_region_version(&engine, region_id, 1, 3, 1, 3);
}
#[tokio::test]
async fn test_filter_is_null_after_alter_add_field() {
test_filter_is_null_after_alter_add_field_with_format(false).await;
test_filter_is_null_after_alter_add_field_with_format(true).await;
}
async fn test_filter_is_null_after_alter_add_field_with_format(flat_format: bool) {
common_telemetry::init_default_ut_logging();
let mut env = TestEnv::new().await;
let engine = env
.create_engine(MitoConfig {
default_flat_format: flat_format,
..Default::default()
})
.await;
let region_id = RegionId::new(1, 1);
let request = CreateRequestBuilder::new().build();
env.get_schema_metadata_manager()
.register_region_table_info(
region_id.table_id(),
"test_table",
"test_catalog",
"test_schema",
None,
env.get_kv_backend(),
)
.await;
let column_schemas = rows_schema(&request);
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
put_rows(
&engine,
region_id,
Rows {
schema: column_schemas,
rows: vec![build_rows_for_key("a", 0, 1, 1).into_iter().next().unwrap()],
},
)
.await;
flush_region(&engine, region_id, None).await;
engine
.handle_request(region_id, RegionRequest::Alter(add_nullable_field1()))
.await
.unwrap();
let region = engine.get_region(region_id).unwrap();
let metadata = region.metadata().column_metadatas.clone();
let schema = metadata
.iter()
.map(column_metadata_to_column_schema)
.collect();
put_rows(
&engine,
region_id,
Rows {
schema,
rows: vec![build_row_with_added_field(
&metadata,
"a",
1.0,
Some(10.0),
0,
)],
},
)
.await;
flush_region(&engine, region_id, None).await;
// We skip field filters under merge mode because the flushed field values may be stale before
// the row is merged with newer field data.
let stream = engine
.scan_to_stream(
region_id,
ScanRequest {
filters: vec![col("field_1").is_null()],
..Default::default()
},
)
.await
.unwrap();
let batches = RecordBatches::try_collect(stream).await.unwrap();
let expected = "\
+-------+---------+---------------------+---------+
| tag_0 | field_0 | ts | field_1 |
+-------+---------+---------------------+---------+
| a | 1.0 | 1970-01-01T00:00:00 | 10.0 |
+-------+---------+---------------------+---------+";
assert_eq!(expected, batches.pretty_print().unwrap());
}
/// Build rows with schema (string, f64, ts_millis, string).
fn build_rows_for_tags(
tag0: &str,

View File

@@ -333,7 +333,7 @@ async fn test_apply_staging_manifest_success_with_format(flat_format: bool) {
let staging_manifest = region.manifest_ctx.staging_manifest().await;
assert!(staging_manifest.is_none());
// The staging partition expr should be cleared.
assert!(region.staging_partition_info.lock().unwrap().is_none());
assert!(region.manifest_ctx.staging_partition_info().is_none());
// The staging manifest directory should be empty.
let data_home = env.data_home();
let region_dir = format!("{}/data/test/1_0000000001", data_home.display());

View File

@@ -18,6 +18,8 @@ use std::sync::Arc;
use std::time::Duration;
use api::v1::{ColumnSchema, Rows};
use common_error::ext::ErrorExt;
use common_error::status_code::StatusCode;
use common_recordbatch::{RecordBatches, SendableRecordBatchStream};
use datatypes::arrow::array::AsArray;
use datatypes::arrow::datatypes::TimestampMillisecondType;
@@ -650,7 +652,7 @@ async fn test_readonly_during_compaction_with_format(flat_format: bool) {
}
#[tokio::test]
async fn test_enter_staging_deferred_by_inflight_compaction() {
async fn test_enter_staging_cancels_inflight_local_compaction_before_commit() {
common_telemetry::init_default_ut_logging();
let mut env = TestEnv::new().await;
let listener = Arc::new(CompactionListener::default());
@@ -706,17 +708,91 @@ async fn test_enter_staging_deferred_by_inflight_compaction() {
}),
)
.await
.unwrap();
});
tokio::time::sleep(Duration::from_millis(100)).await;
assert!(!enter_staging.is_finished());
// The enter staging should finished, and the compaction should be cancelled.
assert!(enter_staging.is_finished());
let _ = enter_staging.await.unwrap().unwrap();
}
listener.wake();
enter_staging.await.unwrap();
#[tokio::test]
async fn test_manual_compaction_returns_cancelled_when_enter_staging_cancels_it() {
common_telemetry::init_default_ut_logging();
let mut env = TestEnv::new().await;
let listener = Arc::new(CompactionListener::default());
let engine = env
.create_engine_with(
MitoConfig {
max_background_purges: 1,
..Default::default()
},
None,
Some(listener.clone()),
None,
)
.await;
let region = engine.get_region(region_id).unwrap();
assert!(region.is_staging());
let region_id = RegionId::new(2050, 1);
env.get_schema_metadata_manager()
.register_region_table_info(
region_id.table_id(),
"test_table",
"test_catalog",
"test_schema",
None,
env.get_kv_backend(),
)
.await;
let request = CreateRequestBuilder::new()
.insert_option("compaction.type", "twcs")
.build();
let column_schemas = request
.column_metadatas
.iter()
.map(column_metadata_to_column_schema)
.collect::<Vec<_>>();
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
put_and_flush(&engine, region_id, &column_schemas, 0..10).await;
put_and_flush(&engine, region_id, &column_schemas, 5..20).await;
let engine_cloned = engine.clone();
let compact = tokio::spawn(async move {
engine_cloned
.handle_request(
region_id,
RegionRequest::Compact(RegionCompactRequest::default()),
)
.await
});
listener.wait_handle_finished().await;
let engine_cloned = engine.clone();
let enter_staging = tokio::spawn(async move {
engine_cloned
.handle_request(
region_id,
RegionRequest::EnterStaging(EnterStagingRequest {
partition_directive: StagingPartitionDirective::RejectAllWrites,
}),
)
.await
});
tokio::time::sleep(Duration::from_millis(100)).await;
assert!(compact.is_finished());
assert!(enter_staging.is_finished());
let err = compact.await.unwrap().unwrap_err();
assert_eq!(err.status_code(), StatusCode::Cancelled);
let _ = enter_staging.await.unwrap();
}
#[tokio::test]

View File

@@ -12,8 +12,10 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use api::v1::Rows;
use api::v1::value::ValueData;
use api::v1::{Row, Rows, Value};
use common_recordbatch::RecordBatches;
use datafusion_expr::{col, lit};
use store_api::region_engine::RegionEngine;
use store_api::region_request::RegionRequest;
use store_api::storage::{RegionId, ScanRequest};
@@ -24,6 +26,22 @@ use crate::test_util::{
CreateRequestBuilder, TestEnv, build_rows, delete_rows, flush_region, put_rows, rows_schema,
};
fn build_row_with_nullable_field(key: &str, field_0: Option<f64>, ts_millis: i64) -> Row {
Row {
values: vec![
Value {
value_data: Some(ValueData::StringValue(key.to_string())),
},
Value {
value_data: field_0.map(ValueData::F64Value),
},
Value {
value_data: Some(ValueData::TimestampMillisecondValue(ts_millis)),
},
],
}
}
#[tokio::test]
async fn test_scan_without_filtering_deleted() {
test_scan_without_filtering_deleted_with_format(false).await;
@@ -121,3 +139,84 @@ async fn test_scan_without_filtering_deleted_with_format(flat_format: bool) {
+-------+---------+---------------------+";
assert_eq!(expected, batches.pretty_print().unwrap());
}
#[tokio::test]
async fn test_filter_field_value_after_last_row_update() {
test_filter_field_value_after_last_row_update_with_format(false).await;
test_filter_field_value_after_last_row_update_with_format(true).await;
}
async fn test_filter_field_value_after_last_row_update_with_format(flat_format: bool) {
common_telemetry::init_default_ut_logging();
let mut env = TestEnv::new().await;
let engine = env
.create_engine(MitoConfig {
default_flat_format: flat_format,
..Default::default()
})
.await;
let region_id = RegionId::new(1, 1);
env.get_schema_metadata_manager()
.register_region_table_info(
region_id.table_id(),
"test_table",
"test_catalog",
"test_schema",
None,
env.get_kv_backend(),
)
.await;
let request = CreateRequestBuilder::new().build();
let column_schemas = rows_schema(&request);
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
put_rows(
&engine,
region_id,
Rows {
schema: column_schemas.clone(),
rows: vec![build_row_with_nullable_field("a", Some(10.0), 0)],
},
)
.await;
flush_region(&engine, region_id, None).await;
put_rows(
&engine,
region_id,
Rows {
schema: column_schemas,
rows: vec![build_row_with_nullable_field("a", Some(20.0), 0)],
},
)
.await;
flush_region(&engine, region_id, None).await;
// We skip field filters under merge mode because the flushed field values may be stale before
// the last-row update is merged.
let stream = engine
.scan_to_stream(
region_id,
ScanRequest {
filters: vec![col("field_0").eq(lit(10.0))],
..Default::default()
},
)
.await
.unwrap();
let batches = RecordBatches::try_collect(stream).await.unwrap();
let expected = "\
+-------+---------+---------------------+
| tag_0 | field_0 | ts |
+-------+---------+---------------------+
| a | 20.0 | 1970-01-01T00:00:00 |
+-------+---------+---------------------+";
assert_eq!(expected, batches.pretty_print().unwrap());
}

View File

@@ -138,7 +138,6 @@ async fn test_prune_tag_and_field() {
async fn test_prune_tag_and_field_with_format(flat_format: bool) {
common_telemetry::init_default_ut_logging();
// prune result: only row group 1
check_prune_row_groups(
vec![
col("tag_0").gt(lit(ScalarValue::Utf8(Some("4".to_string())))),
@@ -443,7 +442,10 @@ async fn test_scan_filter_field_after_delete_with_format(flat_format: bool) {
)
.await;
// Scans and filter fields, the field should be deleted.
// Scans and filters by a field value. The mito reader skips field filters under
// `PreFilterMode::SkipFields` (DataFusion re-applies them above the engine), so
// the returned batches still contain all non-deleted rows — the reader's job here
// is only to ensure the delete op is honored.
let request = ScanRequest {
filters: vec![col("field_0").eq(lit(3.0f64))],
..Default::default()
@@ -454,10 +456,12 @@ async fn test_scan_filter_field_after_delete_with_format(flat_format: bool) {
.unwrap();
let batches = RecordBatches::try_collect(stream).await.unwrap();
let expected = "\
+-------+---------+----+
| tag_0 | field_0 | ts |
+-------+---------+----+
+-------+---------+----+";
+-------+---------+---------------------+
| tag_0 | field_0 | ts |
+-------+---------+---------------------+
| 1 | 1.0 | 1970-01-01T00:00:01 |
| 4 | 4.0 | 1970-01-01T00:00:04 |
+-------+---------+---------------------+";
assert_eq!(
expected,
sort_batches_and_print(&batches, &["tag_0", "field_0", "ts"])

View File

@@ -19,7 +19,9 @@ use store_api::region_engine::{
RegionEngine, RegionRole, SetRegionRoleStateResponse, SetRegionRoleStateSuccess,
SettableRegionRoleState,
};
use store_api::region_request::{RegionPutRequest, RegionRequest};
use store_api::region_request::{
EnterStagingRequest, RegionPutRequest, RegionRequest, StagingPartitionDirective,
};
use store_api::storage::RegionId;
use crate::config::MitoConfig;
@@ -241,12 +243,14 @@ async fn test_unified_state_transitions_with_format(flat_format: bool) {
.await
.unwrap();
assert_success_response(&result, 0);
assert_eq!(engine.role(region_id), Some(RegionRole::StagingLeader));
let result = engine
.set_region_role_state_gracefully(region_id, SettableRegionRoleState::Leader)
.await
.unwrap();
assert_success_response(&result, 0);
assert_eq!(engine.role(region_id), Some(RegionRole::Leader));
// Leader -> StagingLeader -> Follower (exit staging via demotion)
engine
@@ -259,6 +263,7 @@ async fn test_unified_state_transitions_with_format(flat_format: bool) {
.await
.unwrap();
assert_success_response(&result, 0);
assert_eq!(engine.role(region_id), Some(RegionRole::Follower));
// Note: Direct Follower -> Leader promotion is no longer allowed
// Use existing set_region_role method for follower -> leader promotion
@@ -277,6 +282,7 @@ async fn test_unified_state_transitions_with_format(flat_format: bool) {
.await
.unwrap();
assert_success_response(&result, 0);
assert_eq!(engine.role(region_id), Some(RegionRole::DowngradingLeader));
// Note: Direct DowngradingLeader -> Leader is no longer allowed
// Use existing set_region_role method for downgrading -> leader promotion
@@ -325,6 +331,264 @@ async fn test_restricted_state_transitions() {
test_restricted_state_transitions_with_format(true).await;
}
#[tokio::test]
async fn test_direct_set_region_role_staging_leader_is_noop() {
let mut env = TestEnv::new().await;
let engine = env.create_engine(MitoConfig::default()).await;
let region_id = RegionId::new(1, 1);
let request = CreateRequestBuilder::new().build();
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
engine
.set_region_role(region_id, RegionRole::StagingLeader)
.unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::Leader));
engine
.set_region_role(region_id, RegionRole::Follower)
.unwrap();
engine
.set_region_role(region_id, RegionRole::StagingLeader)
.unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::Follower));
}
#[tokio::test]
async fn test_direct_set_region_role_exits_staging_state_only() {
let mut env = TestEnv::new().await;
let engine = env.create_engine(MitoConfig::default()).await;
let region_id = RegionId::new(1, 1);
let request = CreateRequestBuilder::new().build();
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
engine
.handle_request(
region_id,
RegionRequest::EnterStaging(EnterStagingRequest {
partition_directive: StagingPartitionDirective::RejectAllWrites,
}),
)
.await
.unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::StagingLeader));
assert!(
engine
.get_region(region_id)
.unwrap()
.manifest_ctx
.staging_partition_info()
.is_some()
);
engine
.set_region_role(region_id, RegionRole::Leader)
.unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::Leader));
assert!(
engine
.get_region(region_id)
.unwrap()
.manifest_ctx
.staging_partition_info()
.is_none()
);
}
#[tokio::test]
async fn test_set_region_role_can_exit_staging_to_leader() {
let mut env = TestEnv::new().await;
let engine = env.create_engine(MitoConfig::default()).await;
let region_id = RegionId::new(1, 1);
let request = CreateRequestBuilder::new().build();
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
engine
.set_region_role_state_gracefully(region_id, SettableRegionRoleState::StagingLeader)
.await
.unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::StagingLeader));
engine
.set_region_role(region_id, RegionRole::Leader)
.unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::Leader));
assert!(
engine
.get_region(region_id)
.unwrap()
.manifest_ctx
.staging_partition_info()
.is_none()
);
}
#[tokio::test]
async fn test_set_region_role_leader_clears_staging_partition_info() {
let mut env = TestEnv::new().await;
let engine = env.create_engine(MitoConfig::default()).await;
let region_id = RegionId::new(1, 1);
let request = CreateRequestBuilder::new().build();
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
engine
.handle_request(
region_id,
RegionRequest::EnterStaging(EnterStagingRequest {
partition_directive: StagingPartitionDirective::RejectAllWrites,
}),
)
.await
.unwrap();
let region = engine.get_region(region_id).unwrap();
assert!(region.manifest_ctx.staging_partition_info().is_some());
engine
.set_region_role(region_id, RegionRole::Leader)
.unwrap();
let region = engine.get_region(region_id).unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::Leader));
assert!(region.manifest_ctx.staging_partition_info().is_none());
}
#[tokio::test]
async fn test_set_region_role_follower_clears_staging_partition_info() {
let mut env = TestEnv::new().await;
let engine = env.create_engine(MitoConfig::default()).await;
let region_id = RegionId::new(1, 1);
let request = CreateRequestBuilder::new().build();
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
engine
.handle_request(
region_id,
RegionRequest::EnterStaging(EnterStagingRequest {
partition_directive: StagingPartitionDirective::RejectAllWrites,
}),
)
.await
.unwrap();
let region = engine.get_region(region_id).unwrap();
assert!(region.manifest_ctx.staging_partition_info().is_some());
engine
.set_region_role(region_id, RegionRole::Follower)
.unwrap();
let region = engine.get_region(region_id).unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::Follower));
assert!(region.manifest_ctx.staging_partition_info().is_none());
}
#[tokio::test]
async fn test_set_region_role_downgrading_leader_clears_staging_partition_info() {
let mut env = TestEnv::new().await;
let engine = env.create_engine(MitoConfig::default()).await;
let region_id = RegionId::new(1, 1);
let request = CreateRequestBuilder::new().build();
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
engine
.handle_request(
region_id,
RegionRequest::EnterStaging(EnterStagingRequest {
partition_directive: StagingPartitionDirective::RejectAllWrites,
}),
)
.await
.unwrap();
let region = engine.get_region(region_id).unwrap();
assert!(region.manifest_ctx.staging_partition_info().is_some());
engine
.set_region_role(region_id, RegionRole::DowngradingLeader)
.unwrap();
let region = engine.get_region(region_id).unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::DowngradingLeader));
assert!(region.manifest_ctx.staging_partition_info().is_none());
}
#[tokio::test]
async fn test_can_reenter_staging_after_direct_exit_cleanup() {
let mut env = TestEnv::new().await;
let engine = env.create_engine(MitoConfig::default()).await;
let region_id = RegionId::new(1, 1);
let request = CreateRequestBuilder::new().build();
engine
.handle_request(region_id, RegionRequest::Create(request))
.await
.unwrap();
engine
.handle_request(
region_id,
RegionRequest::EnterStaging(EnterStagingRequest {
partition_directive: StagingPartitionDirective::RejectAllWrites,
}),
)
.await
.unwrap();
engine
.set_region_role(region_id, RegionRole::Follower)
.unwrap();
engine
.set_region_role(region_id, RegionRole::Leader)
.unwrap();
engine
.handle_request(
region_id,
RegionRequest::EnterStaging(EnterStagingRequest {
partition_directive: StagingPartitionDirective::RejectAllWrites,
}),
)
.await
.unwrap();
let region = engine.get_region(region_id).unwrap();
assert_eq!(engine.role(region_id), Some(RegionRole::StagingLeader));
assert!(region.manifest_ctx.staging_partition_info().is_some());
}
async fn test_restricted_state_transitions_with_format(flat_format: bool) {
let mut env = TestEnv::new().await;
let engine = env

View File

@@ -547,7 +547,7 @@ async fn test_staging_manifest_directory_with_format(flat_format: bool) {
.await
.unwrap();
let region = engine.get_region(region_id).unwrap();
let staging_partition_info = region.staging_partition_info.lock().unwrap().clone();
let staging_partition_info = region.manifest_ctx.staging_partition_info();
assert_eq!(
staging_partition_info
.unwrap()

View File

@@ -1073,6 +1073,9 @@ pub enum Error {
#[snafu(display("Manual compaction is override by following operations."))]
ManualCompactionOverride {},
#[snafu(display("Compaction is cancelled."))]
CompactionCancelled {},
#[snafu(display("Compaction memory exhausted for region {region_id} (policy: {policy})",))]
CompactionMemoryExhausted {
region_id: RegionId,
@@ -1389,7 +1392,7 @@ impl ErrorExt for Error {
#[cfg(feature = "vector_index")]
VectorIndexBuild { .. } | VectorIndexFinish { .. } => StatusCode::Internal,
ManualCompactionOverride {} => StatusCode::Cancelled,
ManualCompactionOverride {} | CompactionCancelled {} => StatusCode::Cancelled,
CompactionMemoryExhausted { source, .. } => source.status_code(),

View File

@@ -16,7 +16,7 @@ use std::fmt::Debug;
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
use common_telemetry::{error, info};
use common_telemetry::{error, info, warn};
use store_api::storage::RegionId;
use store_api::{MIN_VERSION, ManifestVersion};
@@ -69,14 +69,18 @@ impl Inner {
return;
}
if let Err(e) = self.manifest_store.delete_until(version, true).await {
error!(e; "Failed to delete manifest actions until version {} for region {}", version, region_id);
return;
}
// Advance the in-memory checkpoint version as soon as the checkpoint file
// is durable. If the subsequent delta cleanup fails, the on-disk state is
// still consistent (the `_last_checkpoint` metadata points at the new
// checkpoint) and `maybe_do_checkpoint` must not re-checkpoint the same
// range.
self.last_checkpoint_version
.store(version, Ordering::Relaxed);
if let Err(e) = self.manifest_store.delete_until(version, true).await {
warn!(e; "Failed to delete manifest actions until version {} for region {}, leftover files will be ignored on recovery", version, region_id);
}
info!(
"Checkpoint for region {} success, version: {}",
region_id, version

View File

@@ -14,6 +14,7 @@
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use std::time::Instant;
use common_datasource::compression::CompressionType;
use common_telemetry::{debug, info};
@@ -34,8 +35,8 @@ use crate::manifest::action::{
};
use crate::manifest::checkpointer::Checkpointer;
use crate::manifest::storage::{
ManifestObjectStore, file_version, is_checkpoint_file, is_delta_file, manifest_compress_type,
manifest_dir,
ManifestObjectStore, file_version, is_checkpoint_file, is_delta_file, list_start_after,
manifest_compress_type, manifest_dir,
};
use crate::metrics::MANIFEST_OP_ELAPSED;
use crate::region::{ManifestStats, RegionLeaderState, RegionRoleState};
@@ -255,6 +256,7 @@ impl RegionManifestManager {
let _t = MANIFEST_OP_ELAPSED
.with_label_values(&["open"])
.start_timer();
let open_start = Instant::now();
// construct storage
let mut store = ManifestObjectStore::new(
@@ -290,8 +292,15 @@ impl RegionManifestManager {
RegionManifestBuilder::default()
};
let replay_start_version = version;
info!(
"Replaying region manifest {} from version {}, last checkpoint version: {}",
options.manifest_dir, replay_start_version, last_checkpoint_version,
);
// apply actions from storage
let manifests = store.fetch_manifests(version, MAX_VERSION).await?;
let replayed_deltas = manifests.len();
for (manifest_version, raw_action_list) in manifests {
let action_list = RegionMetaActionList::decode(&raw_action_list)?;
@@ -334,6 +343,7 @@ impl RegionManifestManager {
);
let version = manifest.manifest_version;
let manifest_dir = options.manifest_dir.clone();
let checkpointer = Checkpointer::new(
manifest.metadata.region_id,
options,
@@ -344,6 +354,16 @@ impl RegionManifestManager {
manifest
.removed_files
.update_file_removed_cnt_to_stats(stats);
info!(
"Opened region manifest {}, region_id: {}, start_version: {}, last_checkpoint_version: {}, replayed_deltas: {}, final_version: {}, cost: {:?}",
manifest_dir,
manifest.metadata.region_id,
replay_start_version,
last_checkpoint_version,
replayed_deltas,
version,
open_start.elapsed(),
);
Ok(Some(Self {
store,
last_version: manifest_version,
@@ -632,13 +652,17 @@ impl RegionManifestManager {
pub async fn has_update(&self) -> Result<bool> {
let last_version = self.last_version();
let streamer =
self.store
.manifest_lister(false)
.await?
.context(error::EmptyManifestDirSnafu {
manifest_dir: self.store.manifest_dir(),
})?;
// Skip older files at the object-store layer. Files for `v == last_version`
// may still appear (`{path}{v:020}` sorts before `{path}{v:020}.json`) but
// they are filtered out below by the `version > last_version` check.
let start_after = list_start_after(self.store.manifest_dir(), last_version);
let streamer = self
.store
.manifest_lister(false, Some(&start_after))
.await?
.context(error::EmptyManifestDirSnafu {
manifest_dir: self.store.manifest_dir(),
})?;
let need_update = streamer
.try_any(|entry| async move {

View File

@@ -24,18 +24,22 @@ use std::sync::Arc;
use std::sync::atomic::AtomicU64;
use common_datasource::compression::CompressionType;
use common_telemetry::debug;
use common_telemetry::{debug, warn};
use crc32fast::Hasher;
use lazy_static::lazy_static;
use object_store::util::join_dir;
use object_store::{Lister, ObjectStore, util};
use regex::Regex;
use snafu::{ResultExt, ensure};
#[cfg(test)]
use snafu::ResultExt;
use snafu::ensure;
use store_api::ManifestVersion;
use store_api::storage::RegionId;
use crate::cache::manifest_cache::ManifestCache;
use crate::error::{ChecksumMismatchSnafu, OpenDalSnafu, Result};
#[cfg(test)]
use crate::error::OpenDalSnafu;
use crate::error::{ChecksumMismatchSnafu, Result};
use crate::manifest::storage::checkpoint::CheckpointStorage;
use crate::manifest::storage::delta::DeltaStorage;
use crate::manifest::storage::size_tracker::{CheckpointTracker, DeltaTracker, SizeTracker};
@@ -76,6 +80,24 @@ pub fn checkpoint_file(version: ManifestVersion) -> String {
format!("{version:020}.checkpoint")
}
/// Returns a lexicographic `start_after` key for an object-store `list`
/// request over the manifest directory at `path`.
///
/// `path` must be the same directory prefix passed to `lister_with(path)`
/// and must end with `/`. OpenDAL resolves `start_after` against the
/// operator root, not relative to the listed path, so the caller must
/// supply the full prefix — otherwise the bound is compared against keys
/// that already share a longer prefix and is silently a no-op.
pub(crate) fn list_start_after(path: &str, version: ManifestVersion) -> String {
debug_assert!(
path.ends_with('/'),
"list_start_after: path must end with '/', got {path:?}",
);
// Manifest files are named `{version:020}.{json,checkpoint}[.gz]` and sort lexicographically;
// `{path}{version:020}` is a strict prefix of `{path}{version:020}.{json,checkpoint}[.gz]`.
format!("{path}{version:020}")
}
pub fn gen_path(path: &str, file: &str, compress_type: CompressionType) -> String {
if compress_type == CompressionType::Uncompressed {
format!("{}{}", path, file)
@@ -194,11 +216,19 @@ impl ManifestObjectStore {
}
/// Returns an iterator of manifests from normal or staging directory.
pub(crate) async fn manifest_lister(&self, is_staging: bool) -> Result<Option<Lister>> {
///
/// `start_after` is forwarded to the non-staging lister to skip entries
/// whose name is lexicographically less than or equal to it. It is
/// ignored for the staging directory.
pub(crate) async fn manifest_lister(
&self,
is_staging: bool,
start_after: Option<&str>,
) -> Result<Option<Lister>> {
if is_staging {
self.staging_storage.manifest_lister().await
} else {
self.delta_storage.manifest_lister().await
self.delta_storage.manifest_lister(start_after).await
}
}
@@ -239,9 +269,14 @@ impl ManifestObjectStore {
keep_last_checkpoint: bool,
) -> Result<usize> {
// Stores (entry, is_checkpoint, version) in a Vec.
//
// `start_after` is intentionally `None` here: a previous deletion
// may have been interrupted and left stale files at versions below
// the current checkpoint; we need the lister to surface them so
// cleanup can finish.
let entries: Vec<_> = self
.delta_storage
.get_paths(|entry| {
.get_paths(None, |entry| {
let file_name = entry.name();
let is_checkpoint = is_checkpoint_file(file_name);
if is_delta_file(file_name) || is_checkpoint_file(file_name) {
@@ -287,11 +322,11 @@ impl ManifestObjectStore {
.iter()
.map(|(e, _, _)| e.path().to_string())
.collect::<Vec<_>>();
let ret = paths.len();
let total = paths.len();
debug!(
"Deleting {} logs from manifest storage path {} until {}, checkpoint_version: {:?}, paths: {:?}",
ret, self.path, end, checkpoint_version, paths,
total, self.path, end, checkpoint_version, paths,
);
// Remove from cache first
@@ -299,13 +334,37 @@ impl ManifestObjectStore {
remove_from_cache(self.manifest_cache.as_ref(), entry.path()).await;
}
self.object_store
.delete_iter(paths)
.await
.context(OpenDalSnafu)?;
// Try batch delete first. On failure, fall back to per-file deletes.
// This is a workaround for S3-compatible object stores that do not support batch delete. See issue #7986.
let mut succeeded = vec![false; del_entries.len()];
match self.object_store.delete_iter(paths.clone()).await {
Ok(()) => succeeded.fill(true),
Err(batch_err) => {
warn!(
batch_err;
"Batch delete failed for manifest path {}, falling back to per-file delete for {} paths",
self.path, total,
);
for (i, path) in paths.iter().enumerate() {
if let Err(e) = self.object_store.delete(path).await {
warn!(
e;
"Failed to delete manifest file {} under {}, aborting fallback, {} files will be retried on next checkpoint",
path, self.path, total - i,
);
break;
}
succeeded[i] = true;
}
}
}
// delete manifest sizes
for (_, is_checkpoint, version) in &del_entries {
let mut deleted = 0usize;
for (i, (_, is_checkpoint, version)) in del_entries.iter().enumerate() {
if !succeeded[i] {
continue;
}
deleted += 1;
if *is_checkpoint {
self.size_tracker
.remove(&size_tracker::FileKey::Checkpoint(*version));
@@ -315,7 +374,7 @@ impl ManifestObjectStore {
}
}
Ok(ret)
Ok(deleted)
}
/// Save the delta manifest file.
@@ -420,12 +479,16 @@ mod tests {
use crate::manifest::storage::checkpoint::CheckpointMetadata;
fn new_test_manifest_store() -> ManifestObjectStore {
new_test_manifest_store_at("/")
}
fn new_test_manifest_store_at(path: &str) -> ManifestObjectStore {
common_telemetry::init_default_ut_logging();
let tmp_dir = create_temp_dir("test_manifest_log_store");
let builder = Fs::default().root(&tmp_dir.path().to_string_lossy());
let object_store = ObjectStore::new(builder).unwrap().finish();
ManifestObjectStore::new(
"/",
path,
object_store,
CompressionType::Uncompressed,
Default::default(),
@@ -690,4 +753,66 @@ mod tests {
assert_eq!(log_store.total_manifest_size(), 0);
}
#[tokio::test]
async fn test_scan_with_start_after_uncompress() {
let mut log_store = new_test_manifest_store();
log_store.set_compress_type(CompressionType::Uncompressed);
test_scan_with_start_after_case(log_store).await;
}
#[tokio::test]
async fn test_scan_with_start_after_compress() {
let mut log_store = new_test_manifest_store();
log_store.set_compress_type(CompressionType::Gzip);
test_scan_with_start_after_case(log_store).await;
}
// OpenDAL resolves `start_after` against the operator
// root, so the bound must embed the manifest directory prefix. Running the
// same assertions against a non-root path exercises that composition.
#[tokio::test]
async fn test_scan_with_start_after_nested_path() {
let mut log_store = new_test_manifest_store_at("/nested/region-1/");
log_store.set_compress_type(CompressionType::Uncompressed);
test_scan_with_start_after_case(log_store).await;
}
async fn test_scan_with_start_after_case(mut log_store: ManifestObjectStore) {
for v in 0..10 {
log_store
.save(v, format!("hello, {v}").as_bytes(), false)
.await
.unwrap();
}
// A checkpoint at version 5 shares the directory; scan must still
// return only delta files in range.
log_store
.save_checkpoint(5, "checkpoint".as_bytes())
.await
.unwrap();
// start > 0: `start_after` must skip pre-start deltas without losing any.
let entries = log_store.delta_storage.scan(3, 10).await.unwrap();
let versions: Vec<_> = entries.iter().map(|(v, _)| *v).collect();
assert_eq!(versions, vec![3, 4, 5, 6, 7, 8, 9]);
// start == 0: `start_after` is skipped; every delta is returned.
let entries = log_store.delta_storage.scan(0, 10).await.unwrap();
let versions: Vec<_> = entries.iter().map(|(v, _)| *v).collect();
assert_eq!(versions, (0..10).collect::<Vec<_>>());
// Upper bound exclusive.
let entries = log_store.delta_storage.scan(7, 9).await.unwrap();
let versions: Vec<_> = entries.iter().map(|(v, _)| *v).collect();
assert_eq!(versions, vec![7, 8]);
// Start beyond any existing file returns empty.
let entries = log_store
.delta_storage
.scan(10, ManifestVersion::MAX)
.await
.unwrap();
assert!(entries.is_empty());
}
}

View File

@@ -34,7 +34,7 @@ use crate::manifest::storage::utils::{
};
use crate::manifest::storage::{
FETCH_MANIFEST_PARALLELISM, delta_file, file_compress_type, file_version, gen_path,
is_delta_file,
is_delta_file, list_start_after,
};
#[derive(Debug, Clone)]
@@ -76,8 +76,18 @@ impl<T: Tracker> DeltaStorage<T> {
}
/// Returns an iterator of manifests from path directory.
pub(crate) async fn manifest_lister(&self) -> Result<Option<Lister>> {
match self.object_store.lister_with(&self.path).await {
///
/// If `start_after` is `Some`, the lister will skip entries whose name is
/// lexicographically less than or equal to it (see OpenDAL's `start_after`).
pub(crate) async fn manifest_lister(
&self,
start_after: Option<&str>,
) -> Result<Option<Lister>> {
let mut builder = self.object_store.lister_with(&self.path);
if let Some(s) = start_after {
builder = builder.start_after(s);
}
match builder.await {
Ok(streamer) => Ok(Some(streamer)),
Err(e) if e.kind() == ErrorKind::NotFound => {
debug!("Manifest directory does not exist: {}", self.path);
@@ -90,16 +100,22 @@ impl<T: Tracker> DeltaStorage<T> {
/// Return all `R`s in the directory that meet the `filter` conditions (that is, the `filter` closure returns `Some(R)`),
/// and discard `R` that does not meet the conditions (that is, the `filter` closure returns `None`)
/// Return an empty vector when directory is not found.
pub async fn get_paths<F, R>(&self, filter: F) -> Result<Vec<R>>
///
/// `start_after` is forwarded to the underlying lister to skip entries
/// whose name is lexicographically less than or equal to it.
pub async fn get_paths<F, R>(&self, start_after: Option<&str>, mut filter: F) -> Result<Vec<R>>
where
F: Fn(Entry) -> Option<R>,
F: FnMut(Entry) -> Option<R>,
{
let Some(streamer) = self.manifest_lister().await? else {
let Some(streamer) = self.manifest_lister(start_after).await? else {
return Ok(vec![]);
};
streamer
.try_filter_map(|e| async { Ok(filter(e)) })
.try_filter_map(|e| {
let result = filter(e);
async { Ok(result) }
})
.try_collect::<Vec<_>>()
.await
.context(OpenDalSnafu)
@@ -113,8 +129,13 @@ impl<T: Tracker> DeltaStorage<T> {
) -> Result<Vec<(ManifestVersion, Entry)>> {
ensure!(start <= end, InvalidScanIndexSnafu { start, end });
// Push the version lower bound into the list request via
// `list_start_after`; skip the hint when `start == 0` (nothing to skip).
let start_after = (start > 0).then(|| list_start_after(&self.path, start));
let mut total_paths = 0;
let mut entries: Vec<(ManifestVersion, Entry)> = self
.get_paths(|entry| {
.get_paths(start_after.as_deref(), |entry| {
total_paths += 1;
let file_name = entry.name();
if is_delta_file(file_name) {
let version = file_version(file_name);
@@ -128,6 +149,16 @@ impl<T: Tracker> DeltaStorage<T> {
sort_manifests(&mut entries);
common_telemetry::debug!(
"DeltaStorage get paths for {}, start: {}, end: {}, start_after: {:?}, total_paths: {}, entries: {}",
self.path,
start,
end,
start_after,
total_paths,
entries.len()
);
Ok(entries)
}

View File

@@ -156,14 +156,14 @@ impl StagingStorage {
/// Returns an iterator of manifests from staging directory.
pub(crate) async fn manifest_lister(&self) -> Result<Option<Lister>> {
self.delta_storage.manifest_lister().await
self.delta_storage.manifest_lister(None).await
}
/// Fetch all staging manifest files and return them as (version, action_list) pairs.
pub(crate) async fn fetch_manifests(&self) -> Result<Vec<(ManifestVersion, Vec<u8>)>> {
let manifest_entries = self
.delta_storage
.get_paths(|entry| {
.get_paths(None, |entry| {
let file_name = entry.name();
if is_delta_file(file_name) {
let version = file_version(file_name);

View File

@@ -14,9 +14,13 @@
use std::assert_matches;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::time::Duration;
use common_datasource::compression::CompressionType;
use object_store::layers::mock::{
Error as MockError, ErrorKind, MockLayerBuilder, OpDelete, Result as MockResult, oio,
};
use store_api::storage::{FileId, RegionId};
use strum::IntoEnumIterator;
@@ -26,6 +30,7 @@ use crate::manifest::action::{
};
use crate::manifest::manager::RegionManifestManager;
use crate::manifest::storage::checkpoint::CheckpointMetadata;
use crate::manifest::storage::is_delta_file;
use crate::manifest::tests::utils::basic_region_metadata;
use crate::sst::file::FileMeta;
use crate::test_util::TestEnv;
@@ -118,7 +123,7 @@ async fn manager_without_checkpoint() {
let mut paths = manager
.store()
.delta_storage()
.get_paths(|e| Some(e.name().to_string()))
.get_paths(None, |e| Some(e.name().to_string()))
.await
.unwrap();
paths.sort_unstable();
@@ -161,7 +166,7 @@ async fn manager_with_checkpoint_distance_1() {
let mut paths = manager
.store()
.delta_storage()
.get_paths(|e| Some(e.name().to_string()))
.get_paths(None, |e| Some(e.name().to_string()))
.await
.unwrap();
paths.sort_unstable();
@@ -414,7 +419,7 @@ async fn manifest_install_manifest_to_with_checkpoint() {
let mut paths = manager
.store()
.delta_storage()
.get_paths(|e| Some(e.name().to_string()))
.get_paths(None, |e| Some(e.name().to_string()))
.await
.unwrap();
@@ -489,3 +494,96 @@ async fn test_checkpoint_bypass_in_staging_mode() {
// Checkpoint should include all 16 actions (15 from staging + 1 from writable)
assert_eq!(last_version, 16);
}
/// A deleter that fails on `flush`, simulating the S3 batch-delete failure
/// described in issue #7986.
struct FailingDeleter {
inner: oio::Deleter,
flush_calls: Arc<AtomicUsize>,
}
impl oio::Delete for FailingDeleter {
fn delete(&mut self, path: &str, args: OpDelete) -> MockResult<()> {
self.inner.delete(path, args)
}
async fn flush(&mut self) -> MockResult<usize> {
self.flush_calls.fetch_add(1, Ordering::Relaxed);
Err(MockError::new(
ErrorKind::Unexpected,
"mock manifest delete flush failure",
))
}
}
#[tokio::test]
async fn checkpoint_advances_and_recovery_works_when_delete_fails() {
common_telemetry::init_default_ut_logging();
let flush_calls = Arc::new(AtomicUsize::new(0));
let factory_flush_calls = flush_calls.clone();
let mock_layer = MockLayerBuilder::default()
.deleter_factory(Arc::new(move |inner| {
Box::new(FailingDeleter {
inner,
flush_calls: factory_flush_calls.clone(),
})
}))
.build()
.unwrap();
let env = TestEnv::new().await.with_mock_layer(mock_layer);
let metadata = Arc::new(basic_region_metadata());
let mut manager = env
.create_manifest_manager(CompressionType::Uncompressed, 1, Some(metadata))
.await
.unwrap()
.unwrap();
for _ in 0..10 {
manager.update(nop_action(), false).await.unwrap();
while manager.checkpointer().is_doing_checkpoint() {
tokio::time::sleep(Duration::from_millis(10)).await;
}
}
// The checkpointer must have attempted to delete stale files at least once.
assert!(flush_calls.load(Ordering::Relaxed) > 0);
// Despite delete failures, the in-memory checkpoint marker advances so
// subsequent `maybe_do_checkpoint` calls compute correct ranges.
assert_eq!(manager.checkpointer().last_checkpoint_version(), 10);
// And the durable `_last_checkpoint` metadata reflects the latest version.
let (last_version, _) = manager
.store()
.load_last_checkpoint()
.await
.unwrap()
.expect("checkpoint should be durable");
assert_eq!(last_version, 10);
// Stale deltas below the checkpoint version must still be present because
// the mocked deleter refused them.
let file_names: Vec<String> = manager
.store()
.delta_storage()
.get_paths(None, |e| Some(e.name().to_string()))
.await
.unwrap();
let stale_delta_count = file_names.iter().filter(|name| is_delta_file(name)).count();
assert!(
stale_delta_count > 0,
"expected leftover delta files after failed delete, got {:?}",
file_names,
);
// Recovery must succeed despite the leftover deltas.
manager.stop().await;
let reopened = env
.create_manifest_manager(CompressionType::Uncompressed, 1, None)
.await
.unwrap()
.expect("manifest should be recoverable");
assert_eq!(reopened.manifest().manifest_version, 10);
}

View File

@@ -59,7 +59,6 @@ use crate::memtable::time_series::{ValueBuilder, Values};
use crate::memtable::{BoxedRecordBatchIterator, MemScanMetrics, MemtableStats};
use crate::sst::SeriesEstimator;
use crate::sst::index::IndexOutput;
use crate::sst::parquet::file_range::{PreFilterMode, row_group_contains_delete};
use crate::sst::parquet::flat_format::primary_key_column_index;
use crate::sst::parquet::format::{PrimaryKeyArray, PrimaryKeyArrayBuilder};
use crate::sst::parquet::{PARQUET_METADATA_KEY, SstInfo};
@@ -1028,9 +1027,8 @@ impl EncodedBulkPart {
sequence: Option<SequenceRange>,
mem_scan_metrics: Option<MemScanMetrics>,
) -> Result<Option<BoxedRecordBatchIterator>> {
// Compute skip_fields for row group pruning using the same approach as compute_skip_fields in reader.rs.
let skip_fields_for_pruning =
Self::compute_skip_fields(context.pre_filter_mode(), &self.metadata.parquet_metadata);
// Compute skip_fields for row group pruning from the configured pre-filter mode.
let skip_fields_for_pruning = context.pre_filter_mode().skip_fields();
// use predicate to find row groups to read.
let row_groups_to_read =
@@ -1050,20 +1048,6 @@ impl EncodedBulkPart {
)?;
Ok(Some(Box::new(iter) as BoxedRecordBatchIterator))
}
/// Computes whether to skip field columns based on PreFilterMode.
fn compute_skip_fields(pre_filter_mode: PreFilterMode, parquet_meta: &ParquetMetaData) -> bool {
match pre_filter_mode {
PreFilterMode::All => false,
PreFilterMode::SkipFields => true,
PreFilterMode::SkipFieldsOnDelete => {
// Check if any row group contains delete op
(0..parquet_meta.num_row_groups()).any(|rg_idx| {
row_group_contains_delete(parquet_meta, rg_idx, "memtable").unwrap_or(true)
})
}
}
}
}
// TODO(yingwen): max_sequence

View File

@@ -29,7 +29,7 @@ use crate::memtable::bulk::part::EncodedBulkPart;
use crate::memtable::bulk::row_group_reader::MemtableRowGroupReaderBuilder;
use crate::memtable::{MemScanMetrics, MemScanMetricsData};
use crate::metrics::{READ_ROWS_TOTAL, READ_STAGE_ELAPSED};
use crate::sst::parquet::file_range::{PreFilterMode, TagDecodeState};
use crate::sst::parquet::file_range::TagDecodeState;
use crate::sst::parquet::flat_format::{primary_key_column_index, sequence_column_index};
use crate::sst::parquet::prefilter::{CachedPrimaryKeyFilter, prefilter_flat_batch_by_primary_key};
@@ -78,7 +78,7 @@ impl EncodedBulkPartIter {
let (init_reader, current_skip_fields) = match row_groups_to_read.pop_front() {
Some(first_row_group) => {
let skip_fields = builder.compute_skip_fields(&context, first_row_group);
let skip_fields = context.pre_filter_mode().skip_fields();
let reader = builder.build_row_group_reader(first_row_group, None)?;
(Some(reader), skip_fields)
}
@@ -140,9 +140,7 @@ impl EncodedBulkPartIter {
// Previous row group exhausted, read next row group
while let Some(next_row_group) = self.row_groups_to_read.pop_front() {
// Compute skip_fields for this row group
self.current_skip_fields = self
.builder
.compute_skip_fields(&self.context, next_row_group);
self.current_skip_fields = self.context.pre_filter_mode().skip_fields();
let next_reader = self.builder.build_row_group_reader(next_row_group, None)?;
let current = self.current_reader.insert(next_reader);
@@ -299,11 +297,7 @@ impl BulkPartBatchIter {
let projected_batch = self.apply_projection(record_batch)?;
// Apply combined filtering (both predicate and sequence filters)
let skip_fields = match self.context.pre_filter_mode() {
PreFilterMode::All => false,
PreFilterMode::SkipFields => true,
PreFilterMode::SkipFieldsOnDelete => true,
};
let skip_fields = self.context.pre_filter_mode().skip_fields();
let Some(filtered_batch) = apply_combined_filters(
&self.context,

View File

@@ -31,7 +31,6 @@ use crate::sst::parquet::DEFAULT_READ_BATCH_SIZE;
pub(crate) struct MemtableRowGroupReaderBuilder {
projection: ProjectionMask,
parquet_metadata: Arc<ParquetMetaData>,
arrow_metadata: ArrowReaderMetadata,
data: Bytes,
}
@@ -51,7 +50,6 @@ impl MemtableRowGroupReaderBuilder {
.context(ReadDataPartSnafu)?;
Ok(Self {
projection,
parquet_metadata,
arrow_metadata,
data,
})
@@ -79,23 +77,4 @@ impl MemtableRowGroupReaderBuilder {
builder.build().context(ReadDataPartSnafu)
}
/// Computes whether to skip field filters for a specific row group based on PreFilterMode.
pub(crate) fn compute_skip_fields(
&self,
context: &BulkIterContextRef,
row_group_idx: usize,
) -> bool {
use crate::sst::parquet::file_range::{PreFilterMode, row_group_contains_delete};
match context.pre_filter_mode() {
PreFilterMode::All => false,
PreFilterMode::SkipFields => true,
PreFilterMode::SkipFieldsOnDelete => {
// Check if this specific row group contains delete op
row_group_contains_delete(&self.parquet_metadata, row_group_idx, "memtable")
.unwrap_or(true)
}
}
}
}

View File

@@ -32,7 +32,7 @@ use crate::metrics::PRUNER_ACTIVE_BUILDERS;
use crate::read::range::{FileRangeBuilder, RowGroupIndex};
use crate::read::scan_region::StreamContext;
use crate::read::scan_util::{FileScanMetrics, PartitionMetrics};
use crate::sst::parquet::file_range::FileRange;
use crate::sst::parquet::file_range::{FileRange, PreFilterMode};
use crate::sst::parquet::reader::ReaderMetrics;
/// Number of files to pre-fetch ahead of the current position.
@@ -43,6 +43,8 @@ pub struct PartitionPruner {
pruner: Arc<Pruner>,
/// Files to prune, in the order to scan.
file_indices: Vec<usize>,
/// Per-file pre-filter mode lookup indexed by file_index.
pre_filter_modes: Vec<PreFilterMode>,
/// Current position for tracking pre-fetch progress.
current_position: AtomicUsize,
}
@@ -50,12 +52,15 @@ pub struct PartitionPruner {
impl PartitionPruner {
/// Creates a new `PartitionPruner` for the given partition ranges.
pub fn new(pruner: Arc<Pruner>, partition_ranges: &[PartitionRange]) -> Self {
let mut file_indices = Vec::with_capacity(pruner.inner.stream_ctx.input.num_files());
let num_files = pruner.inner.stream_ctx.input.num_files();
let mut file_indices = Vec::with_capacity(num_files);
let mut pre_filter_modes = vec![PreFilterMode::SkipFields; num_files];
let mut dedup_set = HashSet::with_capacity(pruner.inner.stream_ctx.input.num_files());
let num_memtables = pruner.inner.stream_ctx.input.num_memtables();
for part_range in partition_ranges {
let range_meta = &pruner.inner.stream_ctx.ranges[part_range.identifier];
let pre_filter_mode = pruner.inner.stream_ctx.range_pre_filter_mode(part_range);
for row_group_index in &range_meta.row_group_indices {
if pruner
.inner
@@ -67,6 +72,7 @@ impl PartitionPruner {
continue;
} else {
file_indices.push(file_index);
pre_filter_modes[file_index] = pre_filter_mode;
dedup_set.insert(file_index);
}
}
@@ -76,6 +82,7 @@ impl PartitionPruner {
Self {
pruner,
file_indices,
pre_filter_modes,
current_position: AtomicUsize::new(0),
}
}
@@ -91,11 +98,12 @@ impl PartitionPruner {
reader_metrics: &mut ReaderMetrics,
) -> Result<SmallVec<[FileRange; 2]>> {
let file_index = index.index - self.pruner.inner.stream_ctx.input.num_memtables();
let pre_filter_mode = self.pre_filter_mode(file_index);
// Delegate to underlying Pruner
let ranges = self
.pruner
.build_file_ranges(index, partition_metrics, reader_metrics)
.build_file_ranges(index, pre_filter_mode, partition_metrics, reader_metrics)
.await?;
// Find position and trigger pre-fetch for upcoming files
@@ -115,10 +123,22 @@ impl PartitionPruner {
let end = (start + PREFETCH_COUNT).min(self.file_indices.len());
for i in start..end {
self.pruner
.get_file_builder_background(self.file_indices[i], Some(partition_metrics.clone()));
let file_index = self.file_indices[i];
let pre_filter_mode = self.pre_filter_mode(file_index);
self.pruner.get_file_builder_background(
file_index,
pre_filter_mode,
Some(partition_metrics.clone()),
);
}
}
fn pre_filter_mode(&self, file_index: usize) -> PreFilterMode {
self.pre_filter_modes
.get(file_index)
.copied()
.unwrap_or(PreFilterMode::SkipFields)
}
}
/// A pruner that prunes files for all partitions of a scanner.
@@ -152,6 +172,8 @@ struct FileBuilderEntry {
struct PruneRequest {
/// Index of the file in ScanInput.files.
file_index: usize,
/// Pre-filter mode to use for the file.
pre_filter_mode: PreFilterMode,
/// Oneshot channel to send back the result.
response_tx: Option<oneshot::Sender<Result<Arc<FileRangeBuilder>>>>,
/// Partition metrics for merging reader metrics.
@@ -174,7 +196,6 @@ impl Pruner {
})
})
.collect();
// Create channels and collect senders
let mut worker_senders = Vec::with_capacity(num_workers);
let mut receivers = Vec::with_capacity(num_workers);
@@ -230,6 +251,7 @@ impl Pruner {
pub async fn build_file_ranges(
&self,
index: RowGroupIndex,
pre_filter_mode: PreFilterMode,
partition_metrics: &PartitionMetrics,
reader_metrics: &mut ReaderMetrics,
) -> Result<SmallVec<[FileRange; 2]>> {
@@ -237,7 +259,12 @@ impl Pruner {
// Get builder (from cache or by pruning)
let builder = self
.get_file_builder(file_index, partition_metrics, reader_metrics)
.get_file_builder(
file_index,
pre_filter_mode,
partition_metrics,
reader_metrics,
)
.await?;
// Build ranges
@@ -254,6 +281,7 @@ impl Pruner {
async fn get_file_builder(
&self,
file_index: usize,
pre_filter_mode: PreFilterMode,
partition_metrics: &PartitionMetrics,
reader_metrics: &mut ReaderMetrics,
) -> Result<Arc<FileRangeBuilder>> {
@@ -275,6 +303,7 @@ impl Pruner {
let (response_tx, response_rx) = oneshot::channel();
let request = PruneRequest {
file_index,
pre_filter_mode,
response_tx: Some(response_tx),
partition_metrics: Some(partition_metrics.clone()),
};
@@ -282,7 +311,8 @@ impl Pruner {
let result = if self.worker_senders[worker_idx].send(request).await.is_err() {
common_telemetry::warn!("Worker channel closed, falling back to direct pruning");
// Worker channel closed, falls back to direct pruning
self.prune_file_directly(file_index, reader_metrics).await
self.prune_file_directly(file_index, pre_filter_mode, reader_metrics)
.await
} else {
// Waits for response
match response_rx.await {
@@ -292,7 +322,8 @@ impl Pruner {
"Response channel closed, falling back to direct pruning"
);
// Channel closed, falls back to direct pruning
self.prune_file_directly(file_index, reader_metrics).await
self.prune_file_directly(file_index, pre_filter_mode, reader_metrics)
.await
}
}
};
@@ -304,6 +335,7 @@ impl Pruner {
pub fn get_file_builder_background(
&self,
file_index: usize,
pre_filter_mode: PreFilterMode,
partition_metrics: Option<PartitionMetrics>,
) {
// Fast path: checks cache
@@ -320,6 +352,7 @@ impl Pruner {
let request = PruneRequest {
file_index,
pre_filter_mode,
response_tx: None,
partition_metrics,
};
@@ -338,6 +371,7 @@ impl Pruner {
async fn prune_file_directly(
&self,
file_index: usize,
pre_filter_mode: PreFilterMode,
reader_metrics: &mut ReaderMetrics,
) -> Result<Arc<FileRangeBuilder>> {
let file = &self.inner.stream_ctx.input.files[file_index];
@@ -345,7 +379,7 @@ impl Pruner {
.inner
.stream_ctx
.input
.prune_file(file, reader_metrics)
.prune_file(file, pre_filter_mode, reader_metrics)
.await?;
let arc_builder = Arc::new(builder);
@@ -391,6 +425,7 @@ impl Pruner {
while let Some(request) = rx.recv().await {
let PruneRequest {
file_index,
pre_filter_mode,
response_tx,
partition_metrics,
} = request;
@@ -398,7 +433,6 @@ impl Pruner {
// Check if already cached or in-progress
{
let entry = inner.file_entries[file_index].lock().unwrap();
if let Some(builder) = &entry.builder {
// Cache hit - send immediately
if let Some(response_tx) = response_tx {
@@ -414,7 +448,11 @@ impl Pruner {
let file = &inner.stream_ctx.input.files[file_index];
pruned_files.push(file.file_id().file_id());
let mut metrics = ReaderMetrics::default();
let result = inner.stream_ctx.input.prune_file(file, &mut metrics).await;
let result = inner
.stream_ctx
.input
.prune_file(file, pre_filter_mode, &mut metrics)
.await;
// Update state and notify waiters
let mut entry = inner.file_entries[file_index].lock().unwrap();

View File

@@ -1101,10 +1101,10 @@ impl ScanInput {
pub async fn prune_file(
&self,
file: &FileHandle,
pre_filter_mode: PreFilterMode,
reader_metrics: &mut ReaderMetrics,
) -> Result<FileRangeBuilder> {
let predicate = self.predicate_for_file(file);
let filter_mode = pre_filter_mode(self.append_mode, self.merge_mode);
let decode_pk_values = !self.compaction && self.mapper.has_tags();
let reader = self
.access_layer
@@ -1125,7 +1125,7 @@ impl ScanInput {
let res = reader
.expected_metadata(Some(self.mapper.metadata().clone()))
.compaction(self.compaction)
.pre_filter_mode(filter_mode)
.pre_filter_mode(pre_filter_mode)
.decode_primary_key_values(decode_pk_values)
.build_reader_input(reader_metrics)
.await;
@@ -1330,6 +1330,18 @@ impl ScanInput {
pub fn region_metadata(&self) -> &RegionMetadataRef {
self.mapper.metadata()
}
fn range_pre_filter_mode(&self, source_count: usize) -> PreFilterMode {
if source_count <= 1 {
// Duplicated rows in the same source is not a normal case and we don't provide
// strict dedup semantic (last_row/last_non_null) for it. We expect the duplicated rows
// are exactly identical in the same source so we use PreFilterMode::All for
// performance reason.
return PreFilterMode::All;
}
pre_filter_mode(self.append_mode, self.merge_mode)
}
}
#[cfg(feature = "enterprise")]
@@ -1372,7 +1384,7 @@ fn pre_filter_mode(append_mode: bool, merge_mode: MergeMode) -> PreFilterMode {
}
match merge_mode {
MergeMode::LastRow => PreFilterMode::SkipFieldsOnDelete,
MergeMode::LastRow => PreFilterMode::SkipFields,
MergeMode::LastNonNull => PreFilterMode::SkipFields,
}
}
@@ -1533,6 +1545,13 @@ impl StreamContext {
&& index.index < self.input.num_files() + self.input.num_memtables()
}
pub(crate) fn range_pre_filter_mode(&self, part_range: &PartitionRange) -> PreFilterMode {
let range_meta = &self.ranges[part_range.identifier];
let source_count = range_meta.indices.len();
self.input.range_pre_filter_mode(source_count)
}
/// Retrieves the partition ranges.
pub(crate) fn partition_ranges(&self) -> Vec<PartitionRange> {
self.ranges
@@ -2095,4 +2114,24 @@ mod tests {
assert!(predicate_without_region.exprs().is_empty());
assert_eq!(1, predicate_without_region.dyn_filters().len());
}
#[tokio::test]
async fn test_range_pre_filter_mode() {
let metadata = Arc::new(metadata_with_primary_key(vec![0, 1], false));
let cases = [
(true, MergeMode::LastRow, 1, PreFilterMode::All),
(false, MergeMode::LastNonNull, 1, PreFilterMode::All),
(false, MergeMode::LastRow, 2, PreFilterMode::SkipFields),
(true, MergeMode::LastRow, 2, PreFilterMode::All),
];
for (append_mode, merge_mode, source_count, expected_mode) in cases {
let input = new_scan_input(metadata.clone(), vec![])
.await
.with_append_mode(append_mode)
.with_merge_mode(merge_mode);
assert_eq!(expected_mode, input.range_pre_filter_mode(source_count));
}
}
}

View File

@@ -208,12 +208,18 @@ impl SeqScan {
}
let mapper = stream_ctx.input.mapper.as_flat().unwrap();
let schema = mapper.input_arrow_schema(stream_ctx.input.compaction);
let metrics_reporter = part_metrics.map(|m| m.merge_metrics_reporter());
let reader =
FlatMergeReader::new(schema, sources, DEFAULT_READ_BATCH_SIZE, metrics_reporter)
.await?;
let reader: BoxedRecordBatchStream = if sources.len() == 1 {
// Currently, we can't skip dedup when there is only one source because
// that source may have duplicate rows.
sources.pop().unwrap()
} else {
let schema = mapper.input_arrow_schema(stream_ctx.input.compaction);
let metrics_reporter = part_metrics.map(|m| m.merge_metrics_reporter());
let reader =
FlatMergeReader::new(schema, sources, DEFAULT_READ_BATCH_SIZE, metrics_reporter)
.await?;
Box::pin(reader.into_stream())
};
let dedup = !stream_ctx.input.append_mode;
let dedup_metrics_reporter = part_metrics.map(|m| m.dedup_metrics_reporter());
@@ -221,7 +227,7 @@ impl SeqScan {
match stream_ctx.input.merge_mode {
MergeMode::LastRow => Box::pin(
FlatDedupReader::new(
reader.into_stream().boxed(),
reader,
FlatLastRow::new(stream_ctx.input.filter_deleted),
dedup_metrics_reporter,
)
@@ -229,7 +235,7 @@ impl SeqScan {
) as _,
MergeMode::LastNonNull => Box::pin(
FlatDedupReader::new(
reader.into_stream().boxed(),
reader,
FlatLastNonNull::new(
mapper.field_column_start(),
stream_ctx.input.filter_deleted,
@@ -240,7 +246,7 @@ impl SeqScan {
) as _,
}
} else {
Box::pin(reader.into_stream()) as _
reader
};
let reader = match &stream_ctx.input.series_row_selector {

View File

@@ -156,11 +156,6 @@ pub struct MitoRegion {
pub(crate) topic_latest_entry_id: AtomicU64,
/// The total bytes written to the region.
pub(crate) written_bytes: Arc<AtomicU64>,
/// Partition info of the region in staging mode.
///
/// During the staging mode, the region metadata in [`VersionControlRef`] is not updated,
/// so we need to store the partition info separately.
pub(crate) staging_partition_info: Mutex<Option<StagingPartitionInfo>>,
/// manifest stats
stats: ManifestStats,
}
@@ -333,6 +328,17 @@ impl MitoRegion {
self.manifest_ctx.set_role(next_role, self.region_id);
}
pub(crate) fn region_role(&self) -> RegionRole {
match self.state() {
RegionRoleState::Follower => RegionRole::Follower,
RegionRoleState::Leader(RegionLeaderState::Staging) => RegionRole::StagingLeader,
RegionRoleState::Leader(RegionLeaderState::Downgrading) => {
RegionRole::DowngradingLeader
}
RegionRoleState::Leader(_) => RegionRole::Leader,
}
}
/// Sets the altering state.
/// You should call this method in the worker loop.
pub(crate) fn set_altering(&self) -> Result<()> {
@@ -393,9 +399,8 @@ impl MitoRegion {
/// You should call this method in the worker loop.
/// Transitions from Staging to Writable state.
pub fn exit_staging(&self) -> Result<()> {
*self.staging_partition_info.lock().unwrap() = None;
self.compare_exchange_state(
RegionLeaderState::Staging,
self.manifest_ctx.exit_staging(
self.region_id,
RegionRoleState::Leader(RegionLeaderState::Writable),
)
}
@@ -819,7 +824,7 @@ impl MitoRegion {
pub fn maybe_staging_partition_expr_str(&self) -> Option<String> {
let is_staging = self.is_staging();
if is_staging {
let staging_partition_info = self.staging_partition_info.lock().unwrap();
let staging_partition_info = self.manifest_ctx.staging_partition_info();
if staging_partition_info.is_none() {
warn!(
"Staging partition expr is none for region {} in staging state",
@@ -837,8 +842,8 @@ impl MitoRegion {
pub fn expected_partition_expr_version(&self) -> u64 {
if self.is_staging() {
let staging_partition_info = self.staging_partition_info.lock().unwrap();
staging_partition_info
self.manifest_ctx
.staging_partition_info()
.as_ref()
.map(|info| info.partition_rule_version)
.unwrap_or_default()
@@ -852,8 +857,8 @@ impl MitoRegion {
if !self.is_staging() {
return false;
}
let staging_partition_info = self.staging_partition_info.lock().unwrap();
staging_partition_info
self.manifest_ctx
.staging_partition_info()
.as_ref()
.map(|info| {
matches!(
@@ -873,6 +878,11 @@ pub(crate) struct ManifestContext {
/// The state of the region. The region checks the state before updating
/// manifest.
state: AtomicCell<RegionRoleState>,
/// Partition info of the region in staging mode.
///
/// During the staging mode, the region metadata in [`VersionControlRef`] is not updated,
/// so we need to store the partition info separately.
staging_partition_info: Mutex<Option<StagingPartitionInfo>>,
}
impl ManifestContext {
@@ -880,9 +890,46 @@ impl ManifestContext {
ManifestContext {
manifest_manager: tokio::sync::RwLock::new(manager),
state: AtomicCell::new(state),
staging_partition_info: Mutex::new(None),
}
}
pub(crate) fn staging_partition_info(&self) -> Option<StagingPartitionInfo> {
self.staging_partition_info.lock().unwrap().clone()
}
pub(crate) fn set_staging_partition_info(&self, staging_partition_info: StagingPartitionInfo) {
let mut current = self.staging_partition_info.lock().unwrap();
debug_assert!(current.is_none());
*current = Some(staging_partition_info);
}
fn clear_staging_partition_info(&self) {
*self.staging_partition_info.lock().unwrap() = None;
}
pub(crate) fn exit_staging(
&self,
region_id: RegionId,
next_state: RegionRoleState,
) -> Result<()> {
self.state
.compare_exchange(
RegionRoleState::Leader(RegionLeaderState::Staging),
next_state,
)
.map_err(|actual| {
RegionStateSnafu {
region_id,
state: actual,
expect: RegionRoleState::Leader(RegionLeaderState::Staging),
}
.build()
})?;
self.clear_staging_partition_info();
Ok(())
}
pub(crate) async fn manifest_version(&self) -> ManifestVersion {
self.manifest_manager
.read()
@@ -1028,27 +1075,50 @@ impl ManifestContext {
/// Sets the [`RegionRole`].
///
/// ```text
/// +------------------------------------------+
/// | +-----------------+ |
/// | | | |
/// +---+------+ +-------+-----+ +--v-v---+
/// | Follower | | Downgrading | | Leader |
/// +---^-^----+ +-----+-^-----+ +--+-+---+
/// | | | | | |
/// | +------------------+ +-----------------+ |
/// +------------------------------------------+
///
/// Transition:
/// - Follower -> Leader
/// - Downgrading Leader -> Leader
/// - Leader -> Follower
/// - Downgrading Leader -> Follower
/// - Leader -> Downgrading Leader
/// +---------------------+
/// | Staging Leader |
/// +----------+----------+
/// |
/// v
/// +----------+ +------+-------+ +-------------+
/// | Follower | <-> | Leader | <-> | Downgrading |
/// +-----+----+ +------+-------+ +------+------+
/// ^ ^ |
/// +-----------------+--------------------+
///
/// ```
///
/// # State Transitions
///
/// From `Follower`:
/// - `Follower -> Leader`
///
/// From `Leader`:
/// - `Leader -> Follower`
/// - `Leader -> Downgrading Leader`
///
/// From `Staging Leader`:
/// - `Staging Leader -> Leader`
/// - `Staging Leader -> Follower`
/// - `Staging Leader -> Downgrading Leader`
///
/// From `Downgrading Leader`:
/// - `Downgrading Leader -> Leader`
/// - `Downgrading Leader -> Follower`
pub(crate) fn set_role(&self, next_role: RegionRole, region_id: RegionId) {
match next_role {
RegionRole::Follower => {
if self
.exit_staging(region_id, RegionRoleState::Follower)
.is_ok()
{
info!(
"Convert region {} to follower, previous role state: {:?}",
region_id,
RegionRoleState::Leader(RegionLeaderState::Staging)
);
return;
}
match self.state.fetch_update(|state| {
if !matches!(state, RegionRoleState::Follower) {
Some(RegionRoleState::Follower)
@@ -1071,6 +1141,20 @@ impl ManifestContext {
}
}
RegionRole::Leader => {
if self
.exit_staging(
region_id,
RegionRoleState::Leader(RegionLeaderState::Writable),
)
.is_ok()
{
info!(
"Convert region {} to leader, previous role state: {:?}",
region_id,
RegionRoleState::Leader(RegionLeaderState::Staging)
);
return;
}
match self.state.fetch_update(|state| {
if matches!(
state,
@@ -1096,7 +1180,27 @@ impl ManifestContext {
}
}
}
RegionRole::StagingLeader => {
info!(
"Ignore direct conversion of region {} to staging leader; staging requires the dedicated workflow",
region_id
);
}
RegionRole::DowngradingLeader => {
if self
.exit_staging(
region_id,
RegionRoleState::Leader(RegionLeaderState::Downgrading),
)
.is_ok()
{
info!(
"Convert region {} to downgrading region, previous role state: {:?}",
region_id,
RegionRoleState::Leader(RegionLeaderState::Staging)
);
return;
}
match self.state.compare_exchange(
RegionRoleState::Leader(RegionLeaderState::Writable),
RegionRoleState::Leader(RegionLeaderState::Downgrading),
@@ -1438,8 +1542,8 @@ pub fn parse_partition_expr(partition_expr_str: Option<&str>) -> Result<Option<P
#[cfg(test)]
mod tests {
use std::sync::Arc;
use std::sync::atomic::AtomicU64;
use std::sync::{Arc, Mutex};
use common_datasource::compression::CompressionType;
use common_test_util::temp_dir::create_temp_dir;
@@ -1512,7 +1616,6 @@ mod tests {
topic_latest_entry_id: Default::default(),
written_bytes: Arc::new(AtomicU64::new(0)),
stats: ManifestStats::default(),
staging_partition_info: Mutex::new(None),
}
}
@@ -1684,6 +1787,13 @@ mod tests {
RegionRoleState::Leader(RegionLeaderState::Writable)
);
// Direct Leader -> StagingLeader should be ignored.
manifest_ctx.set_role(RegionRole::StagingLeader, region_id);
assert_eq!(
manifest_ctx.state.load(),
RegionRoleState::Leader(RegionLeaderState::Writable)
);
// Leader -> Downgrading Leader
manifest_ctx.set_role(RegionRole::DowngradingLeader, region_id);
assert_eq!(
@@ -1825,7 +1935,6 @@ mod tests {
topic_latest_entry_id: Default::default(),
written_bytes: Arc::new(AtomicU64::new(0)),
stats: ManifestStats::default(),
staging_partition_info: Mutex::new(None),
};
// Test initial state

View File

@@ -17,7 +17,7 @@
use std::any::TypeId;
use std::collections::HashMap;
use std::sync::atomic::{AtomicI64, AtomicU64};
use std::sync::{Arc, LazyLock, Mutex};
use std::sync::{Arc, LazyLock};
use std::time::Instant;
use common_telemetry::{debug, error, info, warn};
@@ -349,7 +349,6 @@ impl RegionOpener {
topic_latest_entry_id: AtomicU64::new(0),
written_bytes: Arc::new(AtomicU64::new(0)),
stats: self.stats,
staging_partition_info: Mutex::new(None),
}))
}
@@ -586,8 +585,6 @@ impl RegionOpener {
topic_latest_entry_id: AtomicU64::new(topic_latest_entry_id),
written_bytes: Arc::new(AtomicU64::new(0)),
stats: self.stats.clone(),
// TODO(weny): reload the staging partition info from the manifest.
staging_partition_info: Mutex::new(None),
};
let region = Arc::new(region);

View File

@@ -46,9 +46,9 @@ use store_api::storage::{FileId, RegionId};
use tokio::sync::oneshot::{self, Receiver, Sender};
use crate::error::{
CompactRegionSnafu, ConvertColumnDataTypeSnafu, CreateDefaultSnafu, Error, FillDefaultSnafu,
FlushRegionSnafu, InvalidPartitionExprSnafu, InvalidRequestSnafu, MissingPartitionExprSnafu,
Result, UnexpectedSnafu,
CompactRegionSnafu, CompactionCancelledSnafu, ConvertColumnDataTypeSnafu, CreateDefaultSnafu,
Error, FillDefaultSnafu, FlushRegionSnafu, InvalidPartitionExprSnafu, InvalidRequestSnafu,
MissingPartitionExprSnafu, Result, UnexpectedSnafu,
};
use crate::flush::FlushReason;
use crate::manifest::action::{RegionEdit, TruncateKind};
@@ -895,6 +895,8 @@ pub(crate) enum BackgroundNotify {
IndexBuildFailed(IndexBuildFailed),
/// Compaction has finished.
CompactionFinished(CompactionFinished),
/// Compaction has been cancelled cooperatively.
CompactionCancelled(CompactionCancelled),
/// Compaction has failed.
CompactionFailed(CompactionFailed),
/// Truncate result.
@@ -991,6 +993,24 @@ pub(crate) struct CompactionFinished {
pub(crate) edit: RegionEdit,
}
/// Notifies a compaction job has been cancelled cooperatively.
#[derive(Debug)]
pub(crate) struct CompactionCancelled {
/// Region id.
pub(crate) region_id: RegionId,
/// Waiters to wake once the cancellation has been observed by the worker.
pub(crate) senders: Vec<OutputTx>,
}
impl CompactionCancelled {
pub(crate) fn on_success(self) {
for sender in self.senders {
sender.send(CompactionCancelledSnafu {}.fail());
}
info!("Compaction cancelled for region: {}", self.region_id);
}
}
impl CompactionFinished {
pub fn on_success(self) {
// only update compaction time on success
@@ -1149,10 +1169,13 @@ pub(crate) struct CopyRegionFromRequest {
mod tests {
use api::v1::value::ValueData;
use api::v1::{Row, SemanticType};
use common_error::ext::ErrorExt;
use common_error::status_code::StatusCode;
use datatypes::prelude::ConcreteDataType;
use datatypes::schema::ColumnDefaultConstraint;
use mito_codec::test_util::i64_value;
use store_api::metadata::RegionMetadataBuilder;
use tokio::sync::oneshot;
use super::*;
use crate::error::Error;
@@ -1216,6 +1239,21 @@ mod tests {
assert_eq!(None, request.column_index_by_name("c2"));
}
#[test]
fn test_compaction_cancelled_sends_cancelled_error() {
let (tx, rx) = oneshot::channel();
let request = CompactionCancelled {
region_id: RegionId::new(1, 1),
senders: vec![OutputTx::new(tx)],
};
request.on_success();
let err = rx.blocking_recv().unwrap().unwrap_err();
assert!(matches!(err, Error::CompactionCancelled { .. }));
assert_eq!(err.status_code(), StatusCode::Cancelled);
}
#[test]
fn test_write_request_column_num() {
let rows = Rows {

View File

@@ -146,7 +146,7 @@ impl FileRange {
std::slice::from_ref(curr_row_group),
read_format,
self.context.base.expected_metadata.clone(),
self.compute_skip_fields(),
self.context.base.pre_filter_mode.skip_fields(),
);
// not costly to create a predicate here since dynamic filters are wrapped in Arc
@@ -158,22 +158,6 @@ impl FileRange {
.unwrap_or(true) // unexpected, not skip just in case
}
fn compute_skip_fields(&self) -> bool {
match self.context.base.pre_filter_mode {
PreFilterMode::All => false,
PreFilterMode::SkipFields => true,
PreFilterMode::SkipFieldsOnDelete => {
// Check if this specific row group contains delete op
row_group_contains_delete(
self.context.reader_builder.parquet_metadata(),
self.row_group_idx,
self.context.reader_builder.file_path(),
)
.unwrap_or(true)
}
}
}
/// Returns a reader to read the [FileRange].
#[allow(dead_code)]
pub(crate) async fn reader(
@@ -185,7 +169,7 @@ impl FileRange {
return Ok(None);
}
// Compute skip_fields once for this row group
let skip_fields = self.context.should_skip_fields(self.row_group_idx);
let skip_fields = self.context.base.pre_filter_mode.skip_fields();
let parquet_reader = self
.context
.reader_builder
@@ -247,7 +231,7 @@ impl FileRange {
return Ok(None);
}
// Compute skip_fields once for this row group
let skip_fields = self.context.should_skip_fields(self.row_group_idx);
let skip_fields = self.context.base.pre_filter_mode.skip_fields();
let parquet_reader = self
.context
.reader_builder
@@ -404,16 +388,8 @@ impl FileRangeContext {
)
}
/// Determines whether to skip field filters based on PreFilterMode and row group delete status.
pub(crate) fn should_skip_fields(&self, row_group_idx: usize) -> bool {
match self.base.pre_filter_mode {
PreFilterMode::All => false,
PreFilterMode::SkipFields => true,
PreFilterMode::SkipFieldsOnDelete => {
// Check if this specific row group contains delete op
self.contains_delete(row_group_idx).unwrap_or(true)
}
}
pub(crate) fn pre_filter_mode(&self) -> PreFilterMode {
self.base.pre_filter_mode
}
//// Decodes parquet metadata and finds if row group contains delete op.
@@ -447,17 +423,20 @@ impl FileRangeContext {
}
/// Mode to pre-filter columns in a range.
#[derive(Debug, Clone, Copy)]
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum PreFilterMode {
/// Filters all columns.
All,
/// If the range doesn't contain delete op or doesn't have statistics, filters all columns.
/// Otherwise, skips filtering fields.
SkipFieldsOnDelete,
/// Always skip fields.
SkipFields,
}
impl PreFilterMode {
pub(crate) fn skip_fields(self) -> bool {
matches!(self, Self::SkipFields)
}
}
/// Context for partition expression filtering.
pub(crate) struct PartitionFilterContext {
pub(crate) region_partition_physical_expr: Arc<dyn PhysicalExpr>,
@@ -514,7 +493,7 @@ impl RangeBase {
///
/// # Arguments
/// * `input` - The batch to filter
/// * `skip_fields` - Whether to skip field filters based on PreFilterMode and row group delete status
/// * `skip_fields` - Whether to skip field filters based on PreFilterMode
pub(crate) fn precise_filter(
&self,
mut input: Batch,
@@ -626,7 +605,7 @@ impl RangeBase {
///
/// # Arguments
/// * `input` - The RecordBatch to filter
/// * `skip_fields` - Whether to skip field filters based on PreFilterMode and row group delete status
/// * `skip_fields` - Whether to skip field filters based on PreFilterMode
pub(crate) fn precise_filter_flat(
&self,
input: RecordBatch,
@@ -679,7 +658,7 @@ impl RangeBase {
///
/// # Arguments
/// * `input` - The RecordBatch to compute mask for
/// * `skip_fields` - Whether to skip field filters based on PreFilterMode and row group delete status
/// * `skip_fields` - Whether to skip field filters based on PreFilterMode
pub(crate) fn compute_filter_mask_flat(
&self,
input: &RecordBatch,

View File

@@ -72,7 +72,6 @@ use crate::sst::parquet::DEFAULT_READ_BATCH_SIZE;
use crate::sst::parquet::async_reader::SstAsyncFileReader;
use crate::sst::parquet::file_range::{
FileRangeContext, FileRangeContextRef, PartitionFilterContext, PreFilterMode, RangeBase,
row_group_contains_delete,
};
use crate::sst::parquet::format::{ReadFormat, need_override_sequence};
use crate::sst::parquet::metadata::MetadataLoader;
@@ -676,7 +675,7 @@ impl ParquetReaderBuilder {
metrics.rows_total += num_rows as usize;
// Compute skip_fields once for all pruning operations
let skip_fields = self.compute_skip_fields(parquet_meta);
let skip_fields = self.pre_filter_mode.skip_fields();
let mut output = self.row_groups_by_minmax(
read_format,
@@ -1112,25 +1111,6 @@ impl ParquetReaderBuilder {
pruned
}
/// Computes whether to skip field columns when building statistics based on PreFilterMode.
fn compute_skip_fields(&self, parquet_meta: &ParquetMetaData) -> bool {
match self.pre_filter_mode {
PreFilterMode::All => false,
PreFilterMode::SkipFields => true,
PreFilterMode::SkipFieldsOnDelete => {
// Check if any row group contains delete op
let file_path = self.file_handle.file_path(&self.table_dir, self.path_type);
(0..parquet_meta.num_row_groups()).any(|rg_idx| {
row_group_contains_delete(parquet_meta, rg_idx, &file_path)
.inspect_err(|e| {
warn!(e; "Failed to decode min value of op_type, fallback to not skipping fields");
})
.unwrap_or(false)
})
}
}
}
/// Computes row groups selection after min-max pruning.
fn row_groups_by_minmax(
&self,
@@ -1955,7 +1935,7 @@ impl ParquetReader {
return Ok(None);
};
let skip_fields = self.context.should_skip_fields(row_group_idx);
let skip_fields = self.context.pre_filter_mode().skip_fields();
let parquet_reader = self
.context
.reader_builder()
@@ -1988,7 +1968,7 @@ impl ParquetReader {
debug_assert!(context.read_format().as_flat().is_some());
let fetch_metrics = ParquetFetchMetrics::default();
let reader = if let Some((row_group_idx, row_selection)) = selection.pop_first() {
let skip_fields = context.should_skip_fields(row_group_idx);
let skip_fields = context.pre_filter_mode().skip_fields();
let parquet_reader = context
.reader_builder()
.build(context.build_context(

View File

@@ -610,9 +610,16 @@ impl TestEnv {
let manifest_dir = data_home.join("manifest").as_path().display().to_string();
let builder = Fs::default();
let object_store = ObjectStore::new(builder.root(&manifest_dir))
.unwrap()
.finish();
let object_store = if let Some(mock_layer) = self.object_store_mock_layer.as_ref() {
ObjectStore::new(builder.root(&manifest_dir))
.unwrap()
.layer(mock_layer.clone())
.finish()
} else {
ObjectStore::new(builder.root(&manifest_dir))
.unwrap()
.finish()
};
// The "manifest_dir" here should be the relative path from the `object_store`'s root.
// Otherwise the OpenDal's list operation would fail with "StripPrefixError". This is

View File

@@ -1196,6 +1196,9 @@ impl<S: LogStore> RegionWorkerLoop<S> {
BackgroundNotify::CompactionFinished(req) => {
self.handle_compaction_finished(region_id, req).await
}
BackgroundNotify::CompactionCancelled(req) => {
self.handle_compaction_cancelled(region_id, req).await
}
BackgroundNotify::CompactionFailed(req) => self.handle_compaction_failure(req).await,
BackgroundNotify::Truncate(req) => self.handle_truncate_result(req).await,
BackgroundNotify::RegionChange(req) => {

View File

@@ -75,7 +75,7 @@ impl<S: LogStore> RegionWorkerLoop<S> {
return;
}
let staging_partition_info = region.staging_partition_info.lock().unwrap().clone();
let staging_partition_info = region.manifest_ctx.staging_partition_info();
let staging_partition_expr = staging_partition_info
.as_ref()

View File

@@ -23,7 +23,8 @@ use crate::error::RegionNotFoundSnafu;
use crate::metrics::COMPACTION_REQUEST_COUNT;
use crate::region::MitoRegionRef;
use crate::request::{
BuildIndexRequest, CompactionFailed, CompactionFinished, OnFailure, OptionOutputTx,
BuildIndexRequest, CompactionCancelled, CompactionFailed, CompactionFinished, OnFailure,
OptionOutputTx,
};
use crate::sst::index::IndexBuildType;
use crate::worker::RegionWorkerLoop;
@@ -119,6 +120,28 @@ impl<S> RegionWorkerLoop<S> {
self.handle_ddl_requests(&mut pending_ddls).await;
}
pub(crate) async fn handle_compaction_cancelled(
&mut self,
region_id: RegionId,
request: CompactionCancelled,
) where
S: LogStore,
{
request.on_success();
// Reuse the scheduler's finish path to wake pending DDLs after a cooperative stop.
let mut pending_ddls = match self.regions.get_region(region_id) {
Some(_) => {
self.compaction_scheduler
.on_compaction_cancelled(region_id)
.await
}
None => Vec::new(),
};
self.handle_ddl_requests(&mut pending_ddls).await;
}
/// When compaction fails, we simply log the error.
pub(crate) async fn handle_compaction_failure(&mut self, req: CompactionFailed) {
error!(req.err; "Failed to compact region: {}", req.region_id);

View File

@@ -19,6 +19,7 @@ use store_api::logstore::LogStore;
use store_api::region_request::{EnterStagingRequest, StagingPartitionDirective};
use store_api::storage::RegionId;
use crate::compaction::RequestCancelResult;
use crate::error::{RegionNotFoundSnafu, Result, StagingPartitionExprMismatchSnafu};
use crate::flush::FlushReason;
use crate::manifest::action::{RegionMetaAction, RegionMetaActionList, RegionPartitionExprChange};
@@ -42,7 +43,7 @@ impl<S: LogStore> RegionWorkerLoop<S> {
// If the region is already in staging mode, verify the partition directive matches.
if region.is_staging() {
let staging_partition_info = region.staging_partition_info.lock().unwrap().clone();
let staging_partition_info = region.manifest_ctx.staging_partition_info();
// If the partition directive mismatches, return error.
if staging_partition_info
.as_ref()
@@ -98,18 +99,24 @@ impl<S: LogStore> RegionWorkerLoop<S> {
return;
}
if self.compaction_scheduler.is_compacting(region_id) {
// Safety: region is compacting, add ddl request to pending queue.
self.compaction_scheduler
.add_ddl_request_to_pending(SenderDdlRequest {
region_id,
sender,
request: DdlRequest::EnterStaging(EnterStagingRequest {
partition_directive,
}),
});
match self.compaction_scheduler.request_cancel(region_id) {
RequestCancelResult::CancelIssued
| RequestCancelResult::AlreadyCancelling
| RequestCancelResult::TooLateToCancel => {
// Safety: region is compacting or has entered the non-cancellable publish stage,
// keep the DDL pending until the current task finishes or acknowledges cancellation.
self.compaction_scheduler
.add_ddl_request_to_pending(SenderDdlRequest {
region_id,
sender,
request: DdlRequest::EnterStaging(EnterStagingRequest {
partition_directive,
}),
});
return;
return;
}
RequestCancelResult::NotRunning => {}
}
self.handle_enter_staging(region, partition_directive, sender);
@@ -279,10 +286,8 @@ impl<S: LogStore> RegionWorkerLoop<S> {
region: &MitoRegionRef,
partition_directive: StagingPartitionDirective,
) {
let mut staging_partition_info = region.staging_partition_info.lock().unwrap();
debug_assert!(staging_partition_info.is_none());
*staging_partition_info = Some(StagingPartitionInfo::from_partition_directive(
partition_directive,
));
region.manifest_ctx.set_staging_partition_info(
StagingPartitionInfo::from_partition_directive(partition_directive),
);
}
}

View File

@@ -17,6 +17,7 @@ use std::env;
use anyhow::Result;
use common_telemetry::info;
use common_test_util::temp_dir::create_temp_dir;
use futures::TryStreamExt;
use object_store::ObjectStore;
use object_store::services::{Fs, S3};
use object_store::test_util::TempFolder;
@@ -103,6 +104,109 @@ async fn test_object_list(store: &ObjectStore) -> Result<()> {
Ok(())
}
async fn test_object_list_start_after(store: &ObjectStore) -> Result<()> {
let scheme = store.info().scheme();
// `start_after` is a service-level capability. Skip the checks when the
// backend (e.g. the local Fs service) doesn't honor it natively — the
// bound would be silently ignored and the full listing returned.
if !store.info().native_capability().list_with_start_after {
info!("Skip test_object_list_start_after: backend {scheme} lacks start_after support");
return Ok(());
}
info!("Run test_object_list_start_after on backend {scheme}");
let files = [
"00000000000000000001.json",
"00000000000000000002.json",
"00000000000000000003.checkpoint",
"00000000000000000003.json",
"00000000000000000004.json",
];
for name in files {
store.write(name, "x").await?;
}
// Bare 20-digit bound: versions 1..=2 are skipped; version-3 deltas and
// checkpoint are kept (their `.` suffix sorts after the bound).
let lister = store
.lister_with("/")
.start_after("00000000000000000003")
.await?;
let mut got: Vec<String> = lister
.try_collect::<Vec<_>>()
.await?
.into_iter()
.filter(|e| e.metadata().mode() == EntryMode::FILE)
.map(|e| e.name().to_string())
.collect();
got.sort();
let mut expected = vec![
"00000000000000000003.checkpoint".to_string(),
"00000000000000000003.json".to_string(),
"00000000000000000004.json".to_string(),
];
expected.sort();
assert_eq!(expected, got);
// A bound that matches an existing name exactly excludes that name.
let lister = store
.lister_with("/")
.start_after("00000000000000000003.json")
.await?;
let got: Vec<String> = lister
.try_collect::<Vec<_>>()
.await?
.into_iter()
.filter(|e| e.metadata().mode() == EntryMode::FILE)
.map(|e| e.name().to_string())
.collect();
assert_eq!(vec!["00000000000000000004.json".to_string()], got);
for name in files {
store.delete(name).await?;
}
// OpenDAL resolves `start_after` against the operator root, not the
// `lister_with` path. For a nested prefix like `manifest/`, the bound
// must also embed that prefix — passing only the bare 20-digit name is
// silently a no-op because the full keys start with `m...` > `0...`.
let nested_files = [
"manifest/00000000000000000001.json",
"manifest/00000000000000000002.json",
"manifest/00000000000000000003.checkpoint",
"manifest/00000000000000000003.json",
"manifest/00000000000000000004.json",
];
for name in nested_files {
store.write(name, "x").await?;
}
let lister = store
.lister_with("manifest/")
.start_after("manifest/00000000000000000003")
.await?;
let mut got: Vec<String> = lister
.try_collect::<Vec<_>>()
.await?
.into_iter()
.filter(|e| e.metadata().mode() == EntryMode::FILE)
.map(|e| e.name().to_string())
.collect();
got.sort();
let mut expected = vec![
"00000000000000000003.checkpoint".to_string(),
"00000000000000000003.json".to_string(),
"00000000000000000004.json".to_string(),
];
expected.sort();
assert_eq!(expected, got);
for name in nested_files {
store.delete(name).await?;
}
Ok(())
}
fn assert_opendal_metrics() {
let metric_families = prometheus::gather();
let mut buffer = Vec::new();
@@ -129,6 +233,7 @@ async fn test_fs_backend() -> Result<()> {
test_object_crud(&store).await?;
test_object_list(&store).await?;
test_object_list_start_after(&store).await?;
assert_opendal_metrics();
@@ -158,6 +263,7 @@ async fn test_s3_backend() -> Result<()> {
let guard = TempFolder::new(&store, "/");
test_object_crud(&store).await?;
test_object_list(&store).await?;
test_object_list_start_after(&store).await?;
assert_opendal_metrics();
guard.remove_all().await?;
}
@@ -187,6 +293,7 @@ async fn test_oss_backend() -> Result<()> {
let guard = TempFolder::new(&store, "/");
test_object_crud(&store).await?;
test_object_list(&store).await?;
test_object_list_start_after(&store).await?;
assert_opendal_metrics();
guard.remove_all().await?;
}
@@ -216,6 +323,7 @@ async fn test_azblob_backend() -> Result<()> {
let guard = TempFolder::new(&store, "/");
test_object_crud(&store).await?;
test_object_list(&store).await?;
test_object_list_start_after(&store).await?;
assert_opendal_metrics();
guard.remove_all().await?;
}
@@ -244,6 +352,7 @@ async fn test_gcs_backend() -> Result<()> {
let guard = TempFolder::new(&store, "/");
test_object_crud(&store).await?;
test_object_list(&store).await?;
test_object_list_start_after(&store).await?;
assert_opendal_metrics();
guard.remove_all().await?;
}

Some files were not shown because too many files have changed in this diff Show More