feat(meta-srv): add repartition procedure skeleton (#7487 )

Signed-off-by: WenyXu <wenymedia@gmail.com>
refactor(mito2): make MemtableStats fields public (#7488 )
2025-12-26 16:10:02 +00:00 · 2025-12-26 11:23:47 +00:00 · 2025-12-26 09:57:18 +00:00 · 2025-12-26 07:24:11 +00:00 · 2025-12-26 02:24:27 +00:00 · 2025-12-26 01:56:11 +00:00
194 changed files with 12043 additions and 3932 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -67,3 +67,6 @@ greptimedb_data

 # Claude code
 CLAUDE.md
+
+# AGENTS.md
+AGENTS.md
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -102,6 +102,30 @@ like `feat`/`fix`/`docs`, with a concise summary of code change following. AVOID

 All commit messages SHOULD adhere to the [Conventional Commits specification](https://conventionalcommits.org/).

+## AI-Assisted contributions
+
+We have the following policy for AI-assisted PRs:
+
+- The PR author should **understand the core ideas** behind the implementation **end-to-end**, and be able to justify the design and code during review.
+- **Calls out unknowns and assumptions**. It's okay to not fully understand some bits of AI generated code. You should comment on these cases and point them out to reviewers so that they can use their knowledge of the codebase to clear up any concerns. For example, you might comment "calling this function here seems to work but I'm not familiar with how it works internally, I wonder if there's a race condition if it is called concurrently".
+
+### Why fully AI-generated PRs without understanding are not helpful
+
+Today, AI tools cannot reliably make complex changes to GreptimeDB on their own, which is why we rely on pull requests and code review.
+
+The purposes of code review are:
+
+1. Finish the intended task.
+2. Share knowledge between authors and reviewers, as a long-term investment in the project. For this reason, even if someone familiar with the codebase can finish a task quickly, we're still happy to help a new contributor work on it even if it takes longer.
+
+An AI dump for an issue doesn’t meet these purposes. Maintainers could finish the task faster by using AI directly, and the submitters gain little knowledge if they act only as a pass through AI proxy without understanding.
+
+Please understand the reviewing capacity is **very limited** for the project, so large PRs which appear to not have the requisite understanding might not get reviewed, and eventually closed or redirected.
+
+### Better ways to contribute than an “AI dump”
+
+It's recommended to write a high-quality issue with a clear problem statement and a minimal, reproducible example. This can make it easier for others to contribute.
+
 ## Getting Help

 There are many ways to get help when you're stuck. It is recommended to ask for help by opening an issue, with a detailed description
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -2580,10 +2580,12 @@ dependencies = [
 name = "common-sql"
 version = "1.0.0-beta.3"
 dependencies = [
+ "arrow-schema",
 "common-base",
 "common-decimal",
 "common-error",
 "common-macro",
+ "common-telemetry",
 "common-time",
 "datafusion-sql",
 "datatypes",
@@ -5036,6 +5038,7 @@ dependencies = [
 "common-function",
 "common-grpc",
 "common-macro",
+ "common-memory-manager",
 "common-meta",
 "common-options",
 "common-procedure",
@@ -5461,7 +5464,7 @@ dependencies = [
 [[package]]
 name = "greptime-proto"
 version = "0.1.0"
-source = "git+https://github.com/GreptimeTeam/greptime-proto.git?rev=173efe5ec62722089db7c531c0b0d470a072b915#173efe5ec62722089db7c531c0b0d470a072b915"
+source = "git+https://github.com/GreptimeTeam/greptime-proto.git?rev=520fa524f9d590752ea327683e82ffd65721b27c#520fa524f9d590752ea327683e82ffd65721b27c"
 dependencies = [
 "prost 0.13.5",
 "prost-types 0.13.5",
@@ -7619,6 +7622,7 @@ dependencies = [
 "async-trait",
 "base64 0.22.1",
 "bytes",
+ "chrono",
 "common-base",
 "common-error",
 "common-macro",
@@ -9320,9 +9324,9 @@ dependencies = [

 [[package]]
 name = "pgwire"
-version = "0.36.3"
+version = "0.37.0"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "70a2bcdcc4b20a88e0648778ecf00415bbd5b447742275439c22176835056f99"
+checksum = "02d86d57e732d40382ceb9bfea80901d839bae8571aa11c06af9177aed9dfb6c"
 dependencies = [
 "async-trait",
 "base64 0.22.1",
@@ -9341,6 +9345,7 @@ dependencies = [
 "ryu",
 "serde",
 "serde_json",
+ "smol_str",
 "stringprep",
 "thiserror 2.0.17",
 "tokio",
@@ -11505,10 +11510,11 @@ checksum = "1bc711410fbe7399f390ca1c3b60ad0f53f80e95c5eb935e52268a0e2cd49acc"

 [[package]]
 name = "serde"
-version = "1.0.219"
+version = "1.0.228"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "5f0e2c6ed6606019b4e29e69dbaba95b11854410e5347d525002456dbbb786b6"
+checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e"
 dependencies = [
+ "serde_core",
 "serde_derive",
 ]

@@ -11523,10 +11529,19 @@ dependencies = [
 ]

 [[package]]
-name = "serde_derive"
-version = "1.0.219"
+name = "serde_core"
+version = "1.0.228"
 source = "registry+https://github.com/rust-lang/crates.io-index"
-checksum = "5b0276cf7f2c73365f7157c8123c21cd9a50fbbd844757af28ca1f5925fc2a00"
+checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad"
+dependencies = [
+ "serde_derive",
+]
+
+[[package]]
+name = "serde_derive"
+version = "1.0.228"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79"
 dependencies = [
 "proc-macro2",
 "quote",
@@ -11679,6 +11694,7 @@ dependencies = [
 "common-grpc",
 "common-macro",
 "common-mem-prof",
+ "common-memory-manager",
 "common-meta",
 "common-plugins",
 "common-pprof",
@@ -12001,6 +12017,16 @@ dependencies = [
 "serde",
 ]

+[[package]]
+name = "smol_str"
+version = "0.3.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "3498b0a27f93ef1402f20eefacfaa1691272ac4eca1cdc8c596cb0a245d6cbf5"
+dependencies = [
+ "borsh",
+ "serde_core",
+]
+
 [[package]]
 name = "snafu"
 version = "0.7.5"
@@ -12206,7 +12232,7 @@ dependencies = [
 [[package]]
 name = "sqlparser"
 version = "0.58.0"
-source = "git+https://github.com/GreptimeTeam/sqlparser-rs.git?rev=4b519a5caa95472cc3988f5556813a583dd35af1#4b519a5caa95472cc3988f5556813a583dd35af1"
+source = "git+https://github.com/GreptimeTeam/sqlparser-rs.git?rev=a0ce2bc6eb3e804532932f39833c32432f5c9a39#a0ce2bc6eb3e804532932f39833c32432f5c9a39"
 dependencies = [
 "lazy_static",
 "log",
@@ -12230,7 +12256,7 @@ dependencies = [
 [[package]]
 name = "sqlparser_derive"
 version = "0.3.0"
-source = "git+https://github.com/GreptimeTeam/sqlparser-rs.git?rev=4b519a5caa95472cc3988f5556813a583dd35af1#4b519a5caa95472cc3988f5556813a583dd35af1"
+source = "git+https://github.com/GreptimeTeam/sqlparser-rs.git?rev=a0ce2bc6eb3e804532932f39833c32432f5c9a39#a0ce2bc6eb3e804532932f39833c32432f5c9a39"
 dependencies = [
 "proc-macro2",
 "quote",
@@ -12461,6 +12487,7 @@ dependencies = [
 "common-config",
 "common-error",
 "common-macro",
+ "common-memory-manager",
 "common-meta",
 "common-options",
 "common-procedure",
@@ -13163,6 +13190,7 @@ dependencies = [
 "common-event-recorder",
 "common-frontend",
 "common-grpc",
+ "common-memory-manager",
 "common-meta",
 "common-procedure",
 "common-query",
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -150,7 +150,7 @@ etcd-client = { version = "0.16.1", features = [
 fst = "0.4.7"
 futures = "0.3"
 futures-util = "0.3"
-greptime-proto = { git = "https://github.com/GreptimeTeam/greptime-proto.git", rev = "173efe5ec62722089db7c531c0b0d470a072b915" }
+greptime-proto = { git = "https://github.com/GreptimeTeam/greptime-proto.git", rev = "520fa524f9d590752ea327683e82ffd65721b27c" }
 hex = "0.4"
 http = "1"
 humantime = "2.1"
@@ -332,7 +332,7 @@ datafusion-physical-plan = { git = "https://github.com/GreptimeTeam/datafusion.g
 datafusion-datasource = { git = "https://github.com/GreptimeTeam/datafusion.git", rev = "fd4b2abcf3c3e43e94951bda452c9fd35243aab0" }
 datafusion-sql = { git = "https://github.com/GreptimeTeam/datafusion.git", rev = "fd4b2abcf3c3e43e94951bda452c9fd35243aab0" }
 datafusion-substrait = { git = "https://github.com/GreptimeTeam/datafusion.git", rev = "fd4b2abcf3c3e43e94951bda452c9fd35243aab0" }
-sqlparser = { git = "https://github.com/GreptimeTeam/sqlparser-rs.git", rev = "4b519a5caa95472cc3988f5556813a583dd35af1" }                           # branch = "v0.58.x"
+sqlparser = { git = "https://github.com/GreptimeTeam/sqlparser-rs.git", rev = "a0ce2bc6eb3e804532932f39833c32432f5c9a39" }                           # branch = "v0.58.x"

 [profile.release]
 debug = 1
--- a/9
+++ b/9
@@ -14,6 +14,7 @@ BUILDX_BUILDER_NAME ?= gtbuilder
 BASE_IMAGE ?= ubuntu
 RUST_TOOLCHAIN ?= $(shell cat rust-toolchain.toml | grep channel | cut -d'"' -f2)
 CARGO_REGISTRY_CACHE ?= ${HOME}/.cargo/registry
+CARGO_GIT_CACHE ?= ${HOME}/.cargo/git
 ARCH := $(shell uname -m | sed 's/x86_64/amd64/' | sed 's/aarch64/arm64/')
 OUTPUT_DIR := $(shell if [ "$(RELEASE)" = "true" ]; then echo "release"; elif [ ! -z "$(CARGO_PROFILE)" ]; then echo "$(CARGO_PROFILE)" ; else echo "debug"; fi)
 SQLNESS_OPTS ?=
@@ -86,7 +87,7 @@ build: ## Build debug version greptime.
 build-by-dev-builder: ## Build greptime by dev-builder.
 	docker run --network=host \
 	${ASSEMBLED_EXTRA_BUILD_ENV} \
-	-v ${PWD}:/greptimedb -v ${CARGO_REGISTRY_CACHE}:/root/.cargo/registry \
+	-v ${PWD}:/greptimedb -v ${CARGO_REGISTRY_CACHE}:/root/.cargo/registry -v ${CARGO_GIT_CACHE}:/root/.cargo/git \
 	-w /greptimedb ${IMAGE_REGISTRY}/${IMAGE_NAMESPACE}/dev-builder-${BASE_IMAGE}:${DEV_BUILDER_IMAGE_TAG} \
 	make build \
 	CARGO_EXTENSION="${CARGO_EXTENSION}" \
@@ -100,7 +101,7 @@ build-by-dev-builder: ## Build greptime by dev-builder.
 .PHONY: build-android-bin
 build-android-bin: ## Build greptime binary for android.
 	docker run --network=host \
-	-v ${PWD}:/greptimedb -v ${CARGO_REGISTRY_CACHE}:/root/.cargo/registry \
+	-v ${PWD}:/greptimedb -v ${CARGO_REGISTRY_CACHE}:/root/.cargo/registry -v ${CARGO_GIT_CACHE}:/root/.cargo/git \
 	-w /greptimedb ${IMAGE_REGISTRY}/${IMAGE_NAMESPACE}/dev-builder-android:${DEV_BUILDER_IMAGE_TAG} \
 	make build \
 	CARGO_EXTENSION="ndk --platform 23 -t aarch64-linux-android" \
@@ -206,7 +207,7 @@ fix-udeps: ## Remove unused dependencies automatically.
 	@cargo udeps --workspace --all-targets --output json > udeps-report.json || true
 	@echo "Removing unused dependencies..."
 	@python3 scripts/fix-udeps.py udeps-report.json
-	
+
 .PHONY: fmt-check
 fmt-check: ## Check code format.
 	cargo fmt --all -- --check
@@ -224,7 +225,7 @@ stop-etcd: ## Stop single node etcd for testing purpose.
 .PHONY: run-it-in-container
 run-it-in-container: start-etcd ## Run integration tests in dev-builder.
 	docker run --network=host \
-	-v ${PWD}:/greptimedb -v ${CARGO_REGISTRY_CACHE}:/root/.cargo/registry -v /tmp:/tmp \
+	-v ${PWD}:/greptimedb -v ${CARGO_REGISTRY_CACHE}:/root/.cargo/registry -v ${CARGO_GIT_CACHE}:/root/.cargo/git -v /tmp:/tmp \
 	-w /greptimedb ${IMAGE_REGISTRY}/${IMAGE_NAMESPACE}/dev-builder-${BASE_IMAGE}:${DEV_BUILDER_IMAGE_TAG} \
 	make test sqlness-test BUILD_JOBS=${BUILD_JOBS}

--- a/config/config.md
+++ b/config/config.md
@@ -14,11 +14,12 @@
 | --- | -----| ------- | ----------- |
 | `default_timezone` | String | Unset | The default timezone of the server. |
 | `default_column_prefix` | String | Unset | The default column prefix for auto-created time index and value columns. |
+| `max_in_flight_write_bytes` | String | Unset | Maximum total memory for all concurrent write request bodies and messages (HTTP, gRPC, Flight).<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
+| `write_bytes_exhausted_policy` | String | Unset | Policy when write bytes quota is exhausted.<br/>Options: "wait" (default, 10s timeout), "wait(<duration>)" (e.g., "wait(30s)"), "fail" |
 | `init_regions_in_background` | Bool | `false` | Initialize all regions in the background during the startup.<br/>By default, it provides services after all regions have been initialized. |
 | `init_regions_parallelism` | Integer | `16` | Parallelism of initializing regions. |
 | `max_concurrent_queries` | Integer | `0` | The maximum current queries allowed to be executed. Zero means unlimited.<br/>NOTE: This setting affects scan_memory_limit's privileged tier allocation.<br/>When set, 70% of queries get privileged memory access (full scan_memory_limit).<br/>The remaining 30% get standard tier access (70% of scan_memory_limit). |
 | `enable_telemetry` | Bool | `true` | Enable telemetry to collect anonymous usage data. Enabled by default. |
-| `max_in_flight_write_bytes` | String | Unset | The maximum in-flight write bytes. |
 | `runtime` | -- | -- | The runtime options. |
 | `runtime.global_rt_size` | Integer | `8` | The number of threads to execute the runtime for global read operations. |
 | `runtime.compact_rt_size` | Integer | `4` | The number of threads to execute the runtime for global write operations. |
@@ -26,14 +27,12 @@
 | `http.addr` | String | `127.0.0.1:4000` | The address to bind the HTTP server. |
 | `http.timeout` | String | `0s` | HTTP request timeout. Set to 0 to disable timeout. |
 | `http.body_limit` | String | `64MB` | HTTP request body limit.<br/>The following units are supported: `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`.<br/>Set to 0 to disable limit. |
-| `http.max_total_body_memory` | String | Unset | Maximum total memory for all concurrent HTTP request bodies.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
 | `http.enable_cors` | Bool | `true` | HTTP CORS support, it's turned on by default<br/>This allows browser to access http APIs without CORS restrictions |
 | `http.cors_allowed_origins` | Array | Unset | Customize allowed origins for HTTP CORS. |
 | `http.prom_validation_mode` | String | `strict` | Whether to enable validation for Prometheus remote write requests.<br/>Available options:<br/>- strict: deny invalid UTF-8 strings (default).<br/>- lossy: allow invalid UTF-8 strings, replace invalid characters with REPLACEMENT_CHARACTER(U+FFFD).<br/>- unchecked: do not valid strings. |
 | `grpc` | -- | -- | The gRPC server options. |
 | `grpc.bind_addr` | String | `127.0.0.1:4001` | The address to bind the gRPC server. |
 | `grpc.runtime_size` | Integer | `8` | The number of server worker threads. |
-| `grpc.max_total_message_memory` | String | Unset | Maximum total memory for all concurrent gRPC request messages.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
 | `grpc.max_connection_age` | String | Unset | The maximum connection age for gRPC connection.<br/>The value can be a human-readable time string. For example: `10m` for ten minutes or `1h` for one hour.<br/>Refer to https://grpc.io/docs/guides/keepalive/ for more details. |
 | `grpc.tls` | -- | -- | gRPC server TLS options, see `mysql.tls` section. |
 | `grpc.tls.mode` | String | `disable` | TLS mode. |
@@ -227,7 +226,8 @@
 | --- | -----| ------- | ----------- |
 | `default_timezone` | String | Unset | The default timezone of the server. |
 | `default_column_prefix` | String | Unset | The default column prefix for auto-created time index and value columns. |
-| `max_in_flight_write_bytes` | String | Unset | The maximum in-flight write bytes. |
+| `max_in_flight_write_bytes` | String | Unset | Maximum total memory for all concurrent write request bodies and messages (HTTP, gRPC, Flight).<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
+| `write_bytes_exhausted_policy` | String | Unset | Policy when write bytes quota is exhausted.<br/>Options: "wait" (default, 10s timeout), "wait(<duration>)" (e.g., "wait(30s)"), "fail" |
 | `runtime` | -- | -- | The runtime options. |
 | `runtime.global_rt_size` | Integer | `8` | The number of threads to execute the runtime for global read operations. |
 | `runtime.compact_rt_size` | Integer | `4` | The number of threads to execute the runtime for global write operations. |
@@ -238,7 +238,6 @@
 | `http.addr` | String | `127.0.0.1:4000` | The address to bind the HTTP server. |
 | `http.timeout` | String | `0s` | HTTP request timeout. Set to 0 to disable timeout. |
 | `http.body_limit` | String | `64MB` | HTTP request body limit.<br/>The following units are supported: `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`.<br/>Set to 0 to disable limit. |
-| `http.max_total_body_memory` | String | Unset | Maximum total memory for all concurrent HTTP request bodies.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
 | `http.enable_cors` | Bool | `true` | HTTP CORS support, it's turned on by default<br/>This allows browser to access http APIs without CORS restrictions |
 | `http.cors_allowed_origins` | Array | Unset | Customize allowed origins for HTTP CORS. |
 | `http.prom_validation_mode` | String | `strict` | Whether to enable validation for Prometheus remote write requests.<br/>Available options:<br/>- strict: deny invalid UTF-8 strings (default).<br/>- lossy: allow invalid UTF-8 strings, replace invalid characters with REPLACEMENT_CHARACTER(U+FFFD).<br/>- unchecked: do not valid strings. |
@@ -246,7 +245,6 @@
 | `grpc.bind_addr` | String | `127.0.0.1:4001` | The address to bind the gRPC server. |
 | `grpc.server_addr` | String | `127.0.0.1:4001` | The address advertised to the metasrv, and used for connections from outside the host.<br/>If left empty or unset, the server will automatically use the IP address of the first network interface<br/>on the host, with the same port number as the one specified in `grpc.bind_addr`. |
 | `grpc.runtime_size` | Integer | `8` | The number of server worker threads. |
-| `grpc.max_total_message_memory` | String | Unset | Maximum total memory for all concurrent gRPC request messages.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
 | `grpc.flight_compression` | String | `arrow_ipc` | Compression mode for frontend side Arrow IPC service. Available options:<br/>- `none`: disable all compression<br/>- `transport`: only enable gRPC transport compression (zstd)<br/>- `arrow_ipc`: only enable Arrow IPC compression (lz4)<br/>- `all`: enable all compression.<br/>Default to `none` |
 | `grpc.max_connection_age` | String | Unset | The maximum connection age for gRPC connection.<br/>The value can be a human-readable time string. For example: `10m` for ten minutes or `1h` for one hour.<br/>Refer to https://grpc.io/docs/guides/keepalive/ for more details. |
 | `grpc.tls` | -- | -- | gRPC server TLS options, see `mysql.tls` section. |
@@ -346,10 +344,10 @@
 | `store_key_prefix` | String | `""` | If it's not empty, the metasrv will store all data with this key prefix. |
 | `backend` | String | `etcd_store` | The datastore for meta server.<br/>Available values:<br/>- `etcd_store` (default value)<br/>- `memory_store`<br/>- `postgres_store`<br/>- `mysql_store` |
 | `meta_table_name` | String | `greptime_metakv` | Table name in RDS to store metadata. Effect when using a RDS kvbackend.<br/>**Only used when backend is `postgres_store`.** |
-| `meta_schema_name` | String | `greptime_schema` | Optional PostgreSQL schema for metadata table and election table name qualification.<br/>When PostgreSQL public schema is not writable (e.g., PostgreSQL 15+ with restricted public),<br/>set this to a writable schema. GreptimeDB will use `meta_schema_name`.`meta_table_name`.<br/>GreptimeDB will NOT create the schema automatically; please ensure it exists or the user has permission.<br/>**Only used when backend is `postgres_store`.** |
+| `meta_schema_name` | String | `greptime_schema` | Optional PostgreSQL schema for metadata table and election table name qualification.<br/>When PostgreSQL public schema is not writable (e.g., PostgreSQL 15+ with restricted public),<br/>set this to a writable schema. GreptimeDB will use `meta_schema_name`.`meta_table_name`.<br/>**Only used when backend is `postgres_store`.** |
+| `auto_create_schema` | Bool | `true` | Automatically create PostgreSQL schema if it doesn't exist.<br/>When enabled, the system will execute `CREATE SCHEMA IF NOT EXISTS <schema_name>`<br/>before creating metadata tables. This is useful in production environments where<br/>manual schema creation may be restricted.<br/>Default is true.<br/>Note: The PostgreSQL user must have CREATE SCHEMA permission for this to work.<br/>**Only used when backend is `postgres_store`.** |
 | `meta_election_lock_id` | Integer | `1` | Advisory lock id in PostgreSQL for election. Effect when using PostgreSQL as kvbackend<br/>Only used when backend is `postgres_store`. |
 | `selector` | String | `round_robin` | Datanode selector type.<br/>- `round_robin` (default value)<br/>- `lease_based`<br/>- `load_based`<br/>For details, please see "https://docs.greptime.com/developer-guide/metasrv/selector". |
-| `use_memory_store` | Bool | `false` | Store data in memory. |
 | `enable_region_failover` | Bool | `false` | Whether to enable region failover.<br/>This feature is only available on GreptimeDB running on cluster mode and<br/>- Using Remote WAL<br/>- Using shared storage (e.g., s3). |
 | `region_failure_detector_initialization_delay` | String | `10m` | The delay before starting region failure detection.<br/>This delay helps prevent Metasrv from triggering unnecessary region failovers before all Datanodes are fully started.<br/>Especially useful when the cluster is not deployed with GreptimeDB Operator and maintenance mode is not enabled. |
 | `allow_region_failover_on_local_wal` | Bool | `false` | Whether to allow region failover on local WAL.<br/>**This option is not recommended to be set to true, because it may lead to data loss during failover.** |
--- a/config/frontend.example.toml
+++ b/config/frontend.example.toml
@@ -6,9 +6,15 @@ default_timezone = "UTC"
 ## @toml2docs:none-default
 default_column_prefix = "greptime"

-## The maximum in-flight write bytes.
+## Maximum total memory for all concurrent write request bodies and messages (HTTP, gRPC, Flight).
+## Set to 0 to disable the limit. Default: "0" (unlimited)
 ## @toml2docs:none-default
-#+ max_in_flight_write_bytes = "500MB"
+#+ max_in_flight_write_bytes = "1GB"
+
+## Policy when write bytes quota is exhausted.
+## Options: "wait" (default, 10s timeout), "wait(<duration>)" (e.g., "wait(30s)"), "fail"
+## @toml2docs:none-default
+#+ write_bytes_exhausted_policy = "wait"

 ## The runtime options.
 #+ [runtime]
@@ -35,10 +41,6 @@ timeout = "0s"
 ## The following units are supported: `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`.
 ## Set to 0 to disable limit.
 body_limit = "64MB"
-## Maximum total memory for all concurrent HTTP request bodies.
-## Set to 0 to disable the limit. Default: "0" (unlimited)
-## @toml2docs:none-default
-#+ max_total_body_memory = "1GB"
 ## HTTP CORS support, it's turned on by default
 ## This allows browser to access http APIs without CORS restrictions
 enable_cors = true
@@ -62,10 +64,6 @@ bind_addr = "127.0.0.1:4001"
 server_addr = "127.0.0.1:4001"
 ## The number of server worker threads.
 runtime_size = 8
-## Maximum total memory for all concurrent gRPC request messages.
-## Set to 0 to disable the limit. Default: "0" (unlimited)
-## @toml2docs:none-default
-#+ max_total_message_memory = "1GB"
 ## Compression mode for frontend side Arrow IPC service. Available options:
 ## - `none`: disable all compression
 ## - `transport`: only enable gRPC transport compression (zstd)
--- a/config/metasrv.example.toml
+++ b/config/metasrv.example.toml
@@ -34,11 +34,18 @@ meta_table_name = "greptime_metakv"
 ## Optional PostgreSQL schema for metadata table and election table name qualification.
 ## When PostgreSQL public schema is not writable (e.g., PostgreSQL 15+ with restricted public),
 ## set this to a writable schema. GreptimeDB will use `meta_schema_name`.`meta_table_name`.
-## GreptimeDB will NOT create the schema automatically; please ensure it exists or the user has permission.
 ## **Only used when backend is `postgres_store`.**
-
 meta_schema_name = "greptime_schema"

+## Automatically create PostgreSQL schema if it doesn't exist.
+## When enabled, the system will execute `CREATE SCHEMA IF NOT EXISTS <schema_name>`
+## before creating metadata tables. This is useful in production environments where
+## manual schema creation may be restricted.
+## Default is true.
+## Note: The PostgreSQL user must have CREATE SCHEMA permission for this to work.
+## **Only used when backend is `postgres_store`.**
+auto_create_schema = true
+
 ## Advisory lock id in PostgreSQL for election. Effect when using PostgreSQL as kvbackend
 ## Only used when backend is `postgres_store`.
 meta_election_lock_id = 1
@@ -50,9 +57,6 @@ meta_election_lock_id = 1
 ## For details, please see "https://docs.greptime.com/developer-guide/metasrv/selector".
 selector = "round_robin"

-## Store data in memory.
-use_memory_store = false
-
 ## Whether to enable region failover.
 ## This feature is only available on GreptimeDB running on cluster mode and
 ## - Using Remote WAL
--- a/config/standalone.example.toml
+++ b/config/standalone.example.toml
@@ -6,6 +6,16 @@ default_timezone = "UTC"
 ## @toml2docs:none-default
 default_column_prefix = "greptime"

+## Maximum total memory for all concurrent write request bodies and messages (HTTP, gRPC, Flight).
+## Set to 0 to disable the limit. Default: "0" (unlimited)
+## @toml2docs:none-default
+#+ max_in_flight_write_bytes = "1GB"
+
+## Policy when write bytes quota is exhausted.
+## Options: "wait" (default, 10s timeout), "wait(<duration>)" (e.g., "wait(30s)"), "fail"
+## @toml2docs:none-default
+#+ write_bytes_exhausted_policy = "wait"
+
 ## Initialize all regions in the background during the startup.
 ## By default, it provides services after all regions have been initialized.
 init_regions_in_background = false
@@ -22,10 +32,6 @@ max_concurrent_queries = 0
 ## Enable telemetry to collect anonymous usage data. Enabled by default.
 #+ enable_telemetry = true

-## The maximum in-flight write bytes.
-## @toml2docs:none-default
-#+ max_in_flight_write_bytes = "500MB"
-
 ## The runtime options.
 #+ [runtime]
 ## The number of threads to execute the runtime for global read operations.
@@ -43,10 +49,6 @@ timeout = "0s"
 ## The following units are supported: `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`.
 ## Set to 0 to disable limit.
 body_limit = "64MB"
-## Maximum total memory for all concurrent HTTP request bodies.
-## Set to 0 to disable the limit. Default: "0" (unlimited)
-## @toml2docs:none-default
-#+ max_total_body_memory = "1GB"
 ## HTTP CORS support, it's turned on by default
 ## This allows browser to access http APIs without CORS restrictions
 enable_cors = true
@@ -67,10 +69,6 @@ prom_validation_mode = "strict"
 bind_addr = "127.0.0.1:4001"
 ## The number of server worker threads.
 runtime_size = 8
-## Maximum total memory for all concurrent gRPC request messages.
-## Set to 0 to disable the limit. Default: "0" (unlimited)
-## @toml2docs:none-default
-#+ max_total_message_memory = "1GB"
 ## The maximum connection age for gRPC connection.
 ## The value can be a human-readable time string. For example: `10m` for ten minutes or `1h` for one hour.
 ## Refer to https://grpc.io/docs/guides/keepalive/ for more details.
--- a/docs/rfcs/2025-12-05-vector-index.md
+++ b/docs/rfcs/2025-12-05-vector-index.md
@@ -0,0 +1,94 @@
+---
+Feature Name: Vector Index
+Tracking Issue: TBD
+Date: 2025-12-04
+Author: "TBD"
+---
+
+# Summary
+Introduce a per-SST approximate nearest neighbor (ANN) index for `VECTOR(dim)` columns with a pluggable engine. USearch HNSW is the initial engine, while the design keeps VSAG (default when linked) and future engines selectable at DDL or alter time and encoded in the index metadata. The index is built alongside SST creation and accelerates `ORDER BY vec_*_distance(column, <literal vector>) LIMIT k` queries, falling back to the existing brute-force path when an index is unavailable or ineligible.
+
+# Motivation
+Vector distances are currently computed with nalgebra across all rows (O(N)) before sorting, which does not scale to millions of vectors. An on-disk ANN index with sub-linear search reduces latency and compute cost for common RAG, semantic search, and recommendation workloads without changing SQL.
+
+# Details
+
+## Current Behavior
+`VECTOR(dim)` values are stored as binary blobs. Queries call `vec_cos_distance`/`vec_l2sq_distance`/`vec_dot_product` via nalgebra for every row and then sort; there is no indexing or caching.
+
+## Index Eligibility and Configuration
+Only `VECTOR(dim)` columns can be indexed. A column metadata flag follows the existing column-option pattern with an intentionally small surface area:
+- `engine`: `vsag` (default when the binding is built) or `usearch`. If a configured engine is unavailable at runtime, the builder logs and falls back to `usearch` while leaving the option intact for future rebuilds.
+- `metric`: `cosine` (default), `l2sq`, or `dot`; mismatches with query functions force brute-force execution.
+- `m`: HNSW graph connectivity (higher = denser graph, more memory, better recall), default `16`.
+- `ef_construct`: build-time expansion, default `128`.
+- `ef_search`: query-time expansion, default `64`; engines may clamp values.
+
+Option semantics mirror HNSW defaults so both USearch and VSAG can honor them; engine-specific tunables stay in reserved key-value pairs inside the blob header for forward compatibility.
+
+DDL reuses column extensions similar to inverted/fulltext indexes:
+
+```sql
+CREATE TABLE embeddings (
+  ts TIMESTAMP TIME INDEX,
+  id STRING PRIMARY KEY,
+  vec VECTOR(384) VECTOR INDEX WITH (engine = 'vsag', metric = 'cosine', ef_search = 64)
+);
+```
+
+Altering column options toggles the flag, can switch engines (for example `usearch` -> `vsag`), and triggers rebuilds through the existing alter/compaction flow. Engine choice stays in table metadata and each blob header; new SSTs use the configured engine while older SSTs remain readable under their recorded engine until compaction or a manual rebuild rewrites them.
+
+## Storage and Format
+- One vector index per indexed column per SST, stored as a Puffin blob with type `greptime-vector-index-v1`.
+- Each blob records the engine (`usearch`, `vsag`, future values) and engine parameters in the header so readers can select the matching decoder. Mixed-engine SSTs remain readable because the engine id travels with the blob.
+- USearch uses `f32` vectors and SST row offsets (`u64`) as keys; nulls and `OpType::Delete` rows are skipped. Row ids are the absolute SST ordinal so readers can derive `RowSelection` directly from parquet row group lengths without extra side tables.
+- Blob layout:
+  - Header: version, column id, dimension, engine id, metric, `m`, `ef_construct`, `ef_search`, and reserved engine-specific key-value pairs.
+  - Counts: total rows written and indexed rows.
+  - Payload: USearch binary produced by `save_to_buffer`.
+- An empty index (no eligible vectors) results in no available index entry for that column.
+- `puffin_manager` registers the blob type so caches and readers discover it alongside inverted/fulltext/bloom blobs in the same index file.
+
+## Row Visibility and Duplicates
+- The indexer increments `row_offset` for every incoming row (including skipped/null/delete rows) so offsets stay aligned with parquet ordering across row groups.
+- Only `OpType::Put` rows with the expected dimension are inserted; `OpType::Delete` and malformed rows are skipped but still advance `row_offset`, matching the data plane’s visibility rules.
+- Multiple versions of the same primary key remain in the graph; the read path intersects search hits with the standard mito2 deduplication/visibility pipeline (sequence-aware dedup, delete filtering, projection) before returning results.
+- Searches overfetch beyond `k` to compensate for rows discarded by visibility checks and to avoid reissuing index reads.
+
+## Build Path (mito2 write)
+Extend `sst::index::Indexer` to optionally create a `VectorIndexer` when region metadata marks a column as vector-indexed, mirroring how inverted/fulltext/bloom filters attach to `IndexerBuilderImpl` in `mito2`.
+
+The indexer consumes `Batch`/`RecordBatch` data and shares memory tracking and abort semantics with existing indexers:
+- Maintain a running `row_offset` that follows SST write order and spans row groups so the search result can be turned into `RowSelection`.
+- For each `OpType::Put`, if the vector is non-null and matches the declared dimension, insert into USearch with `row_offset` as the key; otherwise skip.
+- Track memory with existing index build metrics; on failure, abort only the vector index while keeping SST writing unaffected.
+
+Engine selection is table-driven: the builder picks the configured engine (default `vsag`, fallback `usearch` if `vsag` is not compiled in) and dispatches to the matching implementation. Unknown engines skip index build with a warning.
+
+On `finish`, serialize the engine-tagged index into the Puffin writer and record `IndexType::Vector` metadata for the column. `IndexOutput` and `FileMeta::indexes/available_indexes` gain a vector entry so manifest updates and `RegionVersion` surface per-column presence, following patterns used by inverted/fulltext/bloom indexes. Planner/metadata validation ensures that mismatched dimensions only reduce the indexed-row count and do not break reads.
+
+## Read Path (mito2 query)
+A planner rule in `query` identifies eligible plans on mito2 tables: a single `ORDER BY vec_cos_distance|vec_l2sq_distance|vec_dot_product(<vector column>, <literal vector>)` in ascending order plus a `LIMIT`/`TopK`. The rule rejects plans with multiple sort keys, non-literal query vectors, or additional projections that would change the distance expression and falls back to brute-force in those cases.
+
+For eligible scans, build a `VectorIndexScan` execution node that:
+- Consults SST metadata for `IndexType::Vector`, loads the index via Puffin using the existing `mito2::cache::index` infrastructure, and dispatches to the engine declared in the blob header (USearch/VSAG/etc.).
+- Runs the engine’s `search` with an overfetch (for example 2×k) to tolerate rows filtered by deletes, dimension mismatches, or late-stage dedup; keys already match SST row offsets produced by the writer.
+- Converts hits to `RowSelection` using parquet row group lengths and reuses the parquet reader so visibility, projection, and deduplication logic stay unchanged; distances are recomputed with `vec_*_distance` before the final trim to k to guarantee ordering and to merge distributed partial results deterministically.
+
+Any unsupported shape, load error, or cache miss falls back to the current brute-force execution path.
+
+## Lifecycle and Maintenance
+Lifecycle piggybacks on the existing SST/index flow: rebuilds run where other secondary indexes do, graphs are always rebuilt from source rows (no HNSW merge), and cleanup/versioning/caching reuse the existing Puffin and index cache paths.
+
+# Implementation Plan
+1. Add the `usearch` dependency (wrapper module in `index` or `mito2`) and map minimal HNSW options; keep an engine trait that allows plugging VSAG without changing the rest of the pipeline.
+2. Introduce `IndexType::Vector` and a column metadata key for vector index options (including `engine`); add SQL parser and `SHOW CREATE TABLE` support for `VECTOR INDEX WITH (...)`.
+3. Implement `vector_index` build/read modules under `mito2` (and `index` if shared), including Puffin serialization that records engine id, blob-type registration with `puffin_manager`, and integration with the `Indexer` builder, `IndexOutput`, manifest updates, and compaction rebuild.
+4. Extend the query planner/execution to detect eligible plans and drive a `RowSelection`-based ANN scan with a fallback path, dispatching by engine at read time and using existing Puffin and index caches.
+5. Add unit tests for serialization/search correctness and an end-to-end test covering plan rewrite, cache usage, engine selection, and fallback; add a mixed-engine test to confirm old USearch blobs still serve after a VSAG switch.
+6. Follow up with an optional VSAG engine binding (feature flag), validate parity with USearch on dense vectors, exercise alternative algorithms (for example PQ), and flip the default `engine` to `vsag` when the binding is present.
+
+# Alternatives
+- **VSAG (follow-up engine):** C++ library with HNSW and additional algorithms (for example SINDI for sparse vectors and PQ) targeting in-memory and disk-friendly search. Provides parameter generators and a roadmap for GPU-assisted build and graph compression. Compared to FAISS it is newer with fewer integrations but bundles sparse/dense coverage and out-of-core focus in one engine. Fits the pluggable-engine design and would become the default `engine = 'vsag'` when linked; USearch remains available for lighter dependencies.
+- **FAISS:** Broad feature set (IVF/IVFPQ/PQ/HNSW, GPU acceleration, scalar filtering, pre/post filters) and battle-tested performance across datasets, but it requires a heavier C++/GPU toolchain, has no official Rust binding, and is less disk-centric than VSAG; integrating it would add more build/distribution burden than USearch/VSAG.
+- **Do nothing:** Keep brute-force evaluation, which remains O(N) and unacceptable at scale.
--- a/src/cli/src/common/store.rs
+++ b/src/cli/src/common/store.rs
@@ -61,6 +61,12 @@ pub struct StoreConfig {
    #[cfg(feature = "pg_kvbackend")]
    #[clap(long)]
    pub meta_schema_name: Option<String>,
+
+    /// Automatically create PostgreSQL schema if it doesn't exist (default: true).
+    #[cfg(feature = "pg_kvbackend")]
+    #[clap(long, default_value_t = true)]
+    pub auto_create_schema: bool,
+
    /// TLS mode for backend store connections (etcd, PostgreSQL, MySQL)
    #[clap(long = "backend-tls-mode", value_enum, default_value = "disable")]
    pub backend_tls_mode: TlsMode,
@@ -138,6 +144,7 @@ impl StoreConfig {
                        schema_name,
                        table_name,
                        max_txn_ops,
+                        self.auto_create_schema,
                    )
                    .await
                    .map_err(BoxedError::new)?)
--- a/src/cmd/src/metasrv.rs
+++ b/src/cmd/src/metasrv.rs
@@ -155,8 +155,6 @@ pub struct StartCommand {
    #[clap(short, long)]
    selector: Option<String>,
    #[clap(long)]
-    use_memory_store: Option<bool>,
-    #[clap(long)]
    enable_region_failover: Option<bool>,
    #[clap(long)]
    http_addr: Option<String>,
@@ -186,7 +184,6 @@ impl Debug for StartCommand {
            .field("store_addrs", &self.sanitize_store_addrs())
            .field("config_file", &self.config_file)
            .field("selector", &self.selector)
-            .field("use_memory_store", &self.use_memory_store)
            .field("enable_region_failover", &self.enable_region_failover)
            .field("http_addr", &self.http_addr)
            .field("http_timeout", &self.http_timeout)
@@ -268,10 +265,6 @@ impl StartCommand {
                .context(error::UnsupportedSelectorTypeSnafu { selector_type })?;
        }

-        if let Some(use_memory_store) = self.use_memory_store {
-            opts.use_memory_store = use_memory_store;
-        }
-
        if let Some(enable_region_failover) = self.enable_region_failover {
            opts.enable_region_failover = enable_region_failover;
        }
@@ -391,7 +384,6 @@ mod tests {
            server_addr = "127.0.0.1:3002"
            store_addr = "127.0.0.1:2379"
            selector = "LeaseBased"
-            use_memory_store = false

            [logging]
            level = "debug"
@@ -470,7 +462,6 @@ mod tests {
            server_addr = "127.0.0.1:3002"
            datanode_lease_secs = 15
            selector = "LeaseBased"
-            use_memory_store = false

            [http]
            addr = "127.0.0.1:4000"
--- a/src/common/function/src/scalars/string.rs
+++ b/src/common/function/src/scalars/string.rs
@@ -14,13 +14,31 @@

 //! String scalar functions

+mod elt;
+mod field;
+mod format;
+mod insert;
+mod locate;
 mod regexp_extract;
+mod space;

+pub(crate) use elt::EltFunction;
+pub(crate) use field::FieldFunction;
+pub(crate) use format::FormatFunction;
+pub(crate) use insert::InsertFunction;
+pub(crate) use locate::LocateFunction;
 pub(crate) use regexp_extract::RegexpExtractFunction;
+pub(crate) use space::SpaceFunction;

 use crate::function_registry::FunctionRegistry;

 /// Register all string functions
 pub fn register_string_functions(registry: &FunctionRegistry) {
+    EltFunction::register(registry);
+    FieldFunction::register(registry);
+    FormatFunction::register(registry);
+    InsertFunction::register(registry);
+    LocateFunction::register(registry);
    RegexpExtractFunction::register(registry);
+    SpaceFunction::register(registry);
 }
--- a/src/common/function/src/scalars/string/elt.rs
+++ b/src/common/function/src/scalars/string/elt.rs
@@ -0,0 +1,252 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! MySQL-compatible ELT function implementation.
+//!
+//! ELT(N, str1, str2, str3, ...) - Returns the Nth string from the list.
+//! Returns NULL if N < 1 or N > number of strings.
+
+use std::fmt;
+use std::sync::Arc;
+
+use datafusion_common::DataFusionError;
+use datafusion_common::arrow::array::{Array, ArrayRef, AsArray, LargeStringBuilder};
+use datafusion_common::arrow::compute::cast;
+use datafusion_common::arrow::datatypes::DataType;
+use datafusion_expr::{ColumnarValue, ScalarFunctionArgs, Signature, Volatility};
+
+use crate::function::Function;
+use crate::function_registry::FunctionRegistry;
+
+const NAME: &str = "elt";
+
+/// MySQL-compatible ELT function.
+///
+/// Syntax: ELT(N, str1, str2, str3, ...)
+/// Returns the Nth string argument. N is 1-based.
+/// Returns NULL if N is NULL, N < 1, or N > number of string arguments.
+#[derive(Debug)]
+pub struct EltFunction {
+    signature: Signature,
+}
+
+impl EltFunction {
+    pub fn register(registry: &FunctionRegistry) {
+        registry.register_scalar(EltFunction::default());
+    }
+}
+
+impl Default for EltFunction {
+    fn default() -> Self {
+        Self {
+            // ELT takes a variable number of arguments: (Int64, String, String, ...)
+            signature: Signature::variadic_any(Volatility::Immutable),
+        }
+    }
+}
+
+impl fmt::Display for EltFunction {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "{}", NAME.to_ascii_uppercase())
+    }
+}
+
+impl Function for EltFunction {
+    fn name(&self) -> &str {
+        NAME
+    }
+
+    fn return_type(&self, _: &[DataType]) -> datafusion_common::Result<DataType> {
+        Ok(DataType::LargeUtf8)
+    }
+
+    fn signature(&self) -> &Signature {
+        &self.signature
+    }
+
+    fn invoke_with_args(
+        &self,
+        args: ScalarFunctionArgs,
+    ) -> datafusion_common::Result<ColumnarValue> {
+        if args.args.len() < 2 {
+            return Err(DataFusionError::Execution(
+                "ELT requires at least 2 arguments: ELT(N, str1, ...)".to_string(),
+            ));
+        }
+
+        let arrays = ColumnarValue::values_to_arrays(&args.args)?;
+        let len = arrays[0].len();
+        let num_strings = arrays.len() - 1;
+
+        // First argument is the index (N) - try to cast to Int64
+        let index_array = if arrays[0].data_type() == &DataType::Null {
+            // All NULLs - return all NULLs
+            let mut builder = LargeStringBuilder::with_capacity(len, 0);
+            for _ in 0..len {
+                builder.append_null();
+            }
+            return Ok(ColumnarValue::Array(Arc::new(builder.finish())));
+        } else {
+            cast(arrays[0].as_ref(), &DataType::Int64).map_err(|e| {
+                DataFusionError::Execution(format!("ELT: index argument cast failed: {}", e))
+            })?
+        };
+
+        // Cast string arguments to LargeUtf8
+        let string_arrays: Vec<ArrayRef> = arrays[1..]
+            .iter()
+            .enumerate()
+            .map(|(i, arr)| {
+                cast(arr.as_ref(), &DataType::LargeUtf8).map_err(|e| {
+                    DataFusionError::Execution(format!(
+                        "ELT: string argument {} cast failed: {}",
+                        i + 1,
+                        e
+                    ))
+                })
+            })
+            .collect::<datafusion_common::Result<Vec<_>>>()?;
+
+        let mut builder = LargeStringBuilder::with_capacity(len, len * 32);
+
+        for i in 0..len {
+            if index_array.is_null(i) {
+                builder.append_null();
+                continue;
+            }
+
+            let n = index_array
+                .as_primitive::<datafusion_common::arrow::datatypes::Int64Type>()
+                .value(i);
+
+            // N is 1-based, check bounds
+            if n < 1 || n as usize > num_strings {
+                builder.append_null();
+                continue;
+            }
+
+            let str_idx = (n - 1) as usize;
+            let str_array = string_arrays[str_idx].as_string::<i64>();
+
+            if str_array.is_null(i) {
+                builder.append_null();
+            } else {
+                builder.append_value(str_array.value(i));
+            }
+        }
+
+        Ok(ColumnarValue::Array(Arc::new(builder.finish())))
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::sync::Arc;
+
+    use datafusion_common::arrow::array::{Int64Array, StringArray};
+    use datafusion_common::arrow::datatypes::Field;
+    use datafusion_expr::ScalarFunctionArgs;
+
+    use super::*;
+
+    fn create_args(arrays: Vec<ArrayRef>) -> ScalarFunctionArgs {
+        let arg_fields: Vec<_> = arrays
+            .iter()
+            .enumerate()
+            .map(|(i, arr)| {
+                Arc::new(Field::new(
+                    format!("arg_{}", i),
+                    arr.data_type().clone(),
+                    true,
+                ))
+            })
+            .collect();
+
+        ScalarFunctionArgs {
+            args: arrays.iter().cloned().map(ColumnarValue::Array).collect(),
+            arg_fields,
+            return_field: Arc::new(Field::new("result", DataType::LargeUtf8, true)),
+            number_rows: arrays[0].len(),
+            config_options: Arc::new(datafusion_common::config::ConfigOptions::default()),
+        }
+    }
+
+    #[test]
+    fn test_elt_basic() {
+        let function = EltFunction::default();
+
+        let n = Arc::new(Int64Array::from(vec![1, 2, 3]));
+        let s1 = Arc::new(StringArray::from(vec!["a", "a", "a"]));
+        let s2 = Arc::new(StringArray::from(vec!["b", "b", "b"]));
+        let s3 = Arc::new(StringArray::from(vec!["c", "c", "c"]));
+
+        let args = create_args(vec![n, s1, s2, s3]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "a");
+            assert_eq!(str_array.value(1), "b");
+            assert_eq!(str_array.value(2), "c");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_elt_out_of_bounds() {
+        let function = EltFunction::default();
+
+        let n = Arc::new(Int64Array::from(vec![0, 4, -1]));
+        let s1 = Arc::new(StringArray::from(vec!["a", "a", "a"]));
+        let s2 = Arc::new(StringArray::from(vec!["b", "b", "b"]));
+        let s3 = Arc::new(StringArray::from(vec!["c", "c", "c"]));
+
+        let args = create_args(vec![n, s1, s2, s3]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert!(str_array.is_null(0)); // 0 is out of bounds
+            assert!(str_array.is_null(1)); // 4 is out of bounds
+            assert!(str_array.is_null(2)); // -1 is out of bounds
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_elt_with_nulls() {
+        let function = EltFunction::default();
+
+        // Row 0: n=1, select s1="a" -> "a"
+        // Row 1: n=NULL -> NULL
+        // Row 2: n=1, select s1=NULL -> NULL
+        let n = Arc::new(Int64Array::from(vec![Some(1), None, Some(1)]));
+        let s1 = Arc::new(StringArray::from(vec![Some("a"), Some("a"), None]));
+        let s2 = Arc::new(StringArray::from(vec![Some("b"), Some("b"), Some("b")]));
+
+        let args = create_args(vec![n, s1, s2]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "a");
+            assert!(str_array.is_null(1)); // N is NULL
+            assert!(str_array.is_null(2)); // Selected string is NULL
+        } else {
+            panic!("Expected array result");
+        }
+    }
+}
--- a/src/common/function/src/scalars/string/field.rs
+++ b/src/common/function/src/scalars/string/field.rs
@@ -0,0 +1,224 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! MySQL-compatible FIELD function implementation.
+//!
+//! FIELD(str, str1, str2, str3, ...) - Returns the 1-based index of str in the list.
+//! Returns 0 if str is not found or is NULL.
+
+use std::fmt;
+use std::sync::Arc;
+
+use datafusion_common::DataFusionError;
+use datafusion_common::arrow::array::{Array, ArrayRef, AsArray, Int64Builder};
+use datafusion_common::arrow::compute::cast;
+use datafusion_common::arrow::datatypes::DataType;
+use datafusion_expr::{ColumnarValue, ScalarFunctionArgs, Signature, Volatility};
+
+use crate::function::Function;
+use crate::function_registry::FunctionRegistry;
+
+const NAME: &str = "field";
+
+/// MySQL-compatible FIELD function.
+///
+/// Syntax: FIELD(str, str1, str2, str3, ...)
+/// Returns the 1-based index of str in the argument list (str1, str2, str3, ...).
+/// Returns 0 if str is not found or is NULL.
+#[derive(Debug)]
+pub struct FieldFunction {
+    signature: Signature,
+}
+
+impl FieldFunction {
+    pub fn register(registry: &FunctionRegistry) {
+        registry.register_scalar(FieldFunction::default());
+    }
+}
+
+impl Default for FieldFunction {
+    fn default() -> Self {
+        Self {
+            // FIELD takes a variable number of arguments: (String, String, String, ...)
+            signature: Signature::variadic_any(Volatility::Immutable),
+        }
+    }
+}
+
+impl fmt::Display for FieldFunction {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "{}", NAME.to_ascii_uppercase())
+    }
+}
+
+impl Function for FieldFunction {
+    fn name(&self) -> &str {
+        NAME
+    }
+
+    fn return_type(&self, _: &[DataType]) -> datafusion_common::Result<DataType> {
+        Ok(DataType::Int64)
+    }
+
+    fn signature(&self) -> &Signature {
+        &self.signature
+    }
+
+    fn invoke_with_args(
+        &self,
+        args: ScalarFunctionArgs,
+    ) -> datafusion_common::Result<ColumnarValue> {
+        if args.args.len() < 2 {
+            return Err(DataFusionError::Execution(
+                "FIELD requires at least 2 arguments: FIELD(str, str1, ...)".to_string(),
+            ));
+        }
+
+        let arrays = ColumnarValue::values_to_arrays(&args.args)?;
+        let len = arrays[0].len();
+
+        // Cast all arguments to LargeUtf8
+        let string_arrays: Vec<ArrayRef> = arrays
+            .iter()
+            .enumerate()
+            .map(|(i, arr)| {
+                cast(arr.as_ref(), &DataType::LargeUtf8).map_err(|e| {
+                    DataFusionError::Execution(format!("FIELD: argument {} cast failed: {}", i, e))
+                })
+            })
+            .collect::<datafusion_common::Result<Vec<_>>>()?;
+
+        let search_str = string_arrays[0].as_string::<i64>();
+        let mut builder = Int64Builder::with_capacity(len);
+
+        for i in 0..len {
+            // If search string is NULL, return 0
+            if search_str.is_null(i) {
+                builder.append_value(0);
+                continue;
+            }
+
+            let needle = search_str.value(i);
+            let mut found_idx = 0i64;
+
+            // Search through the list (starting from index 1 in string_arrays)
+            for (j, str_arr) in string_arrays[1..].iter().enumerate() {
+                let str_array = str_arr.as_string::<i64>();
+                if !str_array.is_null(i) && str_array.value(i) == needle {
+                    found_idx = (j + 1) as i64; // 1-based index
+                    break;
+                }
+            }
+
+            builder.append_value(found_idx);
+        }
+
+        Ok(ColumnarValue::Array(Arc::new(builder.finish())))
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::sync::Arc;
+
+    use datafusion_common::arrow::array::StringArray;
+    use datafusion_common::arrow::datatypes::Field;
+    use datafusion_expr::ScalarFunctionArgs;
+
+    use super::*;
+
+    fn create_args(arrays: Vec<ArrayRef>) -> ScalarFunctionArgs {
+        let arg_fields: Vec<_> = arrays
+            .iter()
+            .enumerate()
+            .map(|(i, arr)| {
+                Arc::new(Field::new(
+                    format!("arg_{}", i),
+                    arr.data_type().clone(),
+                    true,
+                ))
+            })
+            .collect();
+
+        ScalarFunctionArgs {
+            args: arrays.iter().cloned().map(ColumnarValue::Array).collect(),
+            arg_fields,
+            return_field: Arc::new(Field::new("result", DataType::Int64, true)),
+            number_rows: arrays[0].len(),
+            config_options: Arc::new(datafusion_common::config::ConfigOptions::default()),
+        }
+    }
+
+    #[test]
+    fn test_field_basic() {
+        let function = FieldFunction::default();
+
+        let search = Arc::new(StringArray::from(vec!["b", "d", "a"]));
+        let s1 = Arc::new(StringArray::from(vec!["a", "a", "a"]));
+        let s2 = Arc::new(StringArray::from(vec!["b", "b", "b"]));
+        let s3 = Arc::new(StringArray::from(vec!["c", "c", "c"]));
+
+        let args = create_args(vec![search, s1, s2, s3]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let int_array = array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+            assert_eq!(int_array.value(0), 2); // "b" is at index 2
+            assert_eq!(int_array.value(1), 0); // "d" not found
+            assert_eq!(int_array.value(2), 1); // "a" is at index 1
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_field_with_null_search() {
+        let function = FieldFunction::default();
+
+        let search = Arc::new(StringArray::from(vec![Some("a"), None]));
+        let s1 = Arc::new(StringArray::from(vec!["a", "a"]));
+        let s2 = Arc::new(StringArray::from(vec!["b", "b"]));
+
+        let args = create_args(vec![search, s1, s2]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let int_array = array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+            assert_eq!(int_array.value(0), 1); // "a" found at index 1
+            assert_eq!(int_array.value(1), 0); // NULL search returns 0
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_field_case_sensitive() {
+        let function = FieldFunction::default();
+
+        let search = Arc::new(StringArray::from(vec!["A", "a"]));
+        let s1 = Arc::new(StringArray::from(vec!["a", "a"]));
+        let s2 = Arc::new(StringArray::from(vec!["A", "A"]));
+
+        let args = create_args(vec![search, s1, s2]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let int_array = array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+            assert_eq!(int_array.value(0), 2); // "A" matches at index 2
+            assert_eq!(int_array.value(1), 1); // "a" matches at index 1
+        } else {
+            panic!("Expected array result");
+        }
+    }
+}
--- a/src/common/function/src/scalars/string/format.rs
+++ b/src/common/function/src/scalars/string/format.rs
@@ -0,0 +1,512 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! MySQL-compatible FORMAT function implementation.
+//!
+//! FORMAT(X, D) - Formats the number X with D decimal places using thousand separators.
+
+use std::fmt;
+use std::sync::Arc;
+
+use datafusion_common::DataFusionError;
+use datafusion_common::arrow::array::{Array, AsArray, LargeStringBuilder};
+use datafusion_common::arrow::datatypes as arrow_types;
+use datafusion_common::arrow::datatypes::DataType;
+use datafusion_expr::{ColumnarValue, ScalarFunctionArgs, Signature, TypeSignature, Volatility};
+
+use crate::function::Function;
+use crate::function_registry::FunctionRegistry;
+
+const NAME: &str = "format";
+
+/// MySQL-compatible FORMAT function.
+///
+/// Syntax: FORMAT(X, D)
+/// Formats the number X to a format like '#,###,###.##', rounded to D decimal places.
+/// D can be 0 to 30.
+///
+/// Note: This implementation uses the en_US locale (comma as thousand separator,
+/// period as decimal separator).
+#[derive(Debug)]
+pub struct FormatFunction {
+    signature: Signature,
+}
+
+impl FormatFunction {
+    pub fn register(registry: &FunctionRegistry) {
+        registry.register_scalar(FormatFunction::default());
+    }
+}
+
+impl Default for FormatFunction {
+    fn default() -> Self {
+        let mut signatures = Vec::new();
+
+        // Support various numeric types for X
+        let numeric_types = [
+            DataType::Float64,
+            DataType::Float32,
+            DataType::Int64,
+            DataType::Int32,
+            DataType::Int16,
+            DataType::Int8,
+            DataType::UInt64,
+            DataType::UInt32,
+            DataType::UInt16,
+            DataType::UInt8,
+        ];
+
+        // D can be various integer types
+        let int_types = [
+            DataType::Int64,
+            DataType::Int32,
+            DataType::Int16,
+            DataType::Int8,
+            DataType::UInt64,
+            DataType::UInt32,
+            DataType::UInt16,
+            DataType::UInt8,
+        ];
+
+        for x_type in &numeric_types {
+            for d_type in &int_types {
+                signatures.push(TypeSignature::Exact(vec![x_type.clone(), d_type.clone()]));
+            }
+        }
+
+        Self {
+            signature: Signature::one_of(signatures, Volatility::Immutable),
+        }
+    }
+}
+
+impl fmt::Display for FormatFunction {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "{}", NAME.to_ascii_uppercase())
+    }
+}
+
+impl Function for FormatFunction {
+    fn name(&self) -> &str {
+        NAME
+    }
+
+    fn return_type(&self, _: &[DataType]) -> datafusion_common::Result<DataType> {
+        Ok(DataType::LargeUtf8)
+    }
+
+    fn signature(&self) -> &Signature {
+        &self.signature
+    }
+
+    fn invoke_with_args(
+        &self,
+        args: ScalarFunctionArgs,
+    ) -> datafusion_common::Result<ColumnarValue> {
+        if args.args.len() != 2 {
+            return Err(DataFusionError::Execution(
+                "FORMAT requires exactly 2 arguments: FORMAT(X, D)".to_string(),
+            ));
+        }
+
+        let arrays = ColumnarValue::values_to_arrays(&args.args)?;
+        let len = arrays[0].len();
+
+        let x_array = &arrays[0];
+        let d_array = &arrays[1];
+
+        let mut builder = LargeStringBuilder::with_capacity(len, len * 20);
+
+        for i in 0..len {
+            if x_array.is_null(i) || d_array.is_null(i) {
+                builder.append_null();
+                continue;
+            }
+
+            let decimal_places = get_decimal_places(d_array, i)?.clamp(0, 30) as usize;
+
+            let formatted = match x_array.data_type() {
+                DataType::Float64 | DataType::Float32 => {
+                    format_number_float(get_float_value(x_array, i)?, decimal_places)
+                }
+                DataType::Int64
+                | DataType::Int32
+                | DataType::Int16
+                | DataType::Int8
+                | DataType::UInt64
+                | DataType::UInt32
+                | DataType::UInt16
+                | DataType::UInt8 => format_number_integer(x_array, i, decimal_places)?,
+                _ => {
+                    return Err(DataFusionError::Execution(format!(
+                        "FORMAT: unsupported type {:?}",
+                        x_array.data_type()
+                    )));
+                }
+            };
+            builder.append_value(&formatted);
+        }
+
+        Ok(ColumnarValue::Array(Arc::new(builder.finish())))
+    }
+}
+
+/// Get float value from various numeric types.
+fn get_float_value(
+    array: &datafusion_common::arrow::array::ArrayRef,
+    index: usize,
+) -> datafusion_common::Result<f64> {
+    match array.data_type() {
+        DataType::Float64 => Ok(array
+            .as_primitive::<arrow_types::Float64Type>()
+            .value(index)),
+        DataType::Float32 => Ok(array
+            .as_primitive::<arrow_types::Float32Type>()
+            .value(index) as f64),
+        _ => Err(DataFusionError::Execution(format!(
+            "FORMAT: unsupported type {:?}",
+            array.data_type()
+        ))),
+    }
+}
+
+/// Get decimal places from various integer types.
+///
+/// MySQL clamps decimal places to `0..=30`. This function returns an `i64` so the caller can clamp.
+fn get_decimal_places(
+    array: &datafusion_common::arrow::array::ArrayRef,
+    index: usize,
+) -> datafusion_common::Result<i64> {
+    match array.data_type() {
+        DataType::Int64 => Ok(array.as_primitive::<arrow_types::Int64Type>().value(index)),
+        DataType::Int32 => Ok(array.as_primitive::<arrow_types::Int32Type>().value(index) as i64),
+        DataType::Int16 => Ok(array.as_primitive::<arrow_types::Int16Type>().value(index) as i64),
+        DataType::Int8 => Ok(array.as_primitive::<arrow_types::Int8Type>().value(index) as i64),
+        DataType::UInt64 => {
+            let v = array.as_primitive::<arrow_types::UInt64Type>().value(index);
+            Ok(if v > i64::MAX as u64 {
+                i64::MAX
+            } else {
+                v as i64
+            })
+        }
+        DataType::UInt32 => Ok(array.as_primitive::<arrow_types::UInt32Type>().value(index) as i64),
+        DataType::UInt16 => Ok(array.as_primitive::<arrow_types::UInt16Type>().value(index) as i64),
+        DataType::UInt8 => Ok(array.as_primitive::<arrow_types::UInt8Type>().value(index) as i64),
+        _ => Err(DataFusionError::Execution(format!(
+            "FORMAT: unsupported type {:?}",
+            array.data_type()
+        ))),
+    }
+}
+
+fn format_number_integer(
+    array: &datafusion_common::arrow::array::ArrayRef,
+    index: usize,
+    decimal_places: usize,
+) -> datafusion_common::Result<String> {
+    let (is_negative, abs_digits) = match array.data_type() {
+        DataType::Int64 => {
+            let v = array.as_primitive::<arrow_types::Int64Type>().value(index) as i128;
+            (v.is_negative(), v.unsigned_abs().to_string())
+        }
+        DataType::Int32 => {
+            let v = array.as_primitive::<arrow_types::Int32Type>().value(index) as i128;
+            (v.is_negative(), v.unsigned_abs().to_string())
+        }
+        DataType::Int16 => {
+            let v = array.as_primitive::<arrow_types::Int16Type>().value(index) as i128;
+            (v.is_negative(), v.unsigned_abs().to_string())
+        }
+        DataType::Int8 => {
+            let v = array.as_primitive::<arrow_types::Int8Type>().value(index) as i128;
+            (v.is_negative(), v.unsigned_abs().to_string())
+        }
+        DataType::UInt64 => {
+            let v = array.as_primitive::<arrow_types::UInt64Type>().value(index) as u128;
+            (false, v.to_string())
+        }
+        DataType::UInt32 => {
+            let v = array.as_primitive::<arrow_types::UInt32Type>().value(index) as u128;
+            (false, v.to_string())
+        }
+        DataType::UInt16 => {
+            let v = array.as_primitive::<arrow_types::UInt16Type>().value(index) as u128;
+            (false, v.to_string())
+        }
+        DataType::UInt8 => {
+            let v = array.as_primitive::<arrow_types::UInt8Type>().value(index) as u128;
+            (false, v.to_string())
+        }
+        _ => {
+            return Err(DataFusionError::Execution(format!(
+                "FORMAT: unsupported type {:?}",
+                array.data_type()
+            )));
+        }
+    };
+
+    let mut result = String::new();
+    if is_negative {
+        result.push('-');
+    }
+    result.push_str(&add_thousand_separators(&abs_digits));
+
+    if decimal_places > 0 {
+        result.push('.');
+        result.push_str(&"0".repeat(decimal_places));
+    }
+
+    Ok(result)
+}
+
+/// Format a float with thousand separators and `decimal_places` digits after decimal point.
+fn format_number_float(x: f64, decimal_places: usize) -> String {
+    // Handle special cases
+    if x.is_nan() {
+        return "NaN".to_string();
+    }
+    if x.is_infinite() {
+        return if x.is_sign_positive() {
+            "Infinity".to_string()
+        } else {
+            "-Infinity".to_string()
+        };
+    }
+
+    // Round to decimal_places
+    let multiplier = 10f64.powi(decimal_places as i32);
+    let rounded = (x * multiplier).round() / multiplier;
+
+    // Split into integer and fractional parts
+    let is_negative = rounded < 0.0;
+    let abs_value = rounded.abs();
+
+    // Format with the specified decimal places
+    let formatted = if decimal_places == 0 {
+        format!("{:.0}", abs_value)
+    } else {
+        format!("{:.prec$}", abs_value, prec = decimal_places)
+    };
+
+    // Split at decimal point
+    let parts: Vec<&str> = formatted.split('.').collect();
+    let int_part = parts[0];
+    let dec_part = parts.get(1).copied();
+
+    // Add thousand separators to integer part
+    let int_with_sep = add_thousand_separators(int_part);
+
+    // Build result
+    let mut result = String::new();
+    if is_negative {
+        result.push('-');
+    }
+    result.push_str(&int_with_sep);
+    if let Some(dec) = dec_part {
+        result.push('.');
+        result.push_str(dec);
+    }
+
+    result
+}
+
+/// Add thousand separators (commas) to an integer string.
+fn add_thousand_separators(s: &str) -> String {
+    let chars: Vec<char> = s.chars().collect();
+    let len = chars.len();
+
+    if len <= 3 {
+        return s.to_string();
+    }
+
+    let mut result = String::with_capacity(len + len / 3);
+    let first_group_len = len % 3;
+    let first_group_len = if first_group_len == 0 {
+        3
+    } else {
+        first_group_len
+    };
+
+    for (i, ch) in chars.iter().enumerate() {
+        if i > 0 && i >= first_group_len && (i - first_group_len) % 3 == 0 {
+            result.push(',');
+        }
+        result.push(*ch);
+    }
+
+    result
+}
+
+#[cfg(test)]
+mod tests {
+    use std::sync::Arc;
+
+    use datafusion_common::arrow::array::{Float64Array, Int64Array};
+    use datafusion_common::arrow::datatypes::Field;
+    use datafusion_expr::ScalarFunctionArgs;
+
+    use super::*;
+
+    fn create_args(arrays: Vec<datafusion_common::arrow::array::ArrayRef>) -> ScalarFunctionArgs {
+        let arg_fields: Vec<_> = arrays
+            .iter()
+            .enumerate()
+            .map(|(i, arr)| {
+                Arc::new(Field::new(
+                    format!("arg_{}", i),
+                    arr.data_type().clone(),
+                    true,
+                ))
+            })
+            .collect();
+
+        ScalarFunctionArgs {
+            args: arrays.iter().cloned().map(ColumnarValue::Array).collect(),
+            arg_fields,
+            return_field: Arc::new(Field::new("result", DataType::LargeUtf8, true)),
+            number_rows: arrays[0].len(),
+            config_options: Arc::new(datafusion_common::config::ConfigOptions::default()),
+        }
+    }
+
+    #[test]
+    fn test_format_basic() {
+        let function = FormatFunction::default();
+
+        let x = Arc::new(Float64Array::from(vec![1234567.891, 1234.5, 1234567.0]));
+        let d = Arc::new(Int64Array::from(vec![2, 0, 3]));
+
+        let args = create_args(vec![x, d]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "1,234,567.89");
+            assert_eq!(str_array.value(1), "1,235"); // rounded
+            assert_eq!(str_array.value(2), "1,234,567.000");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_format_negative() {
+        let function = FormatFunction::default();
+
+        let x = Arc::new(Float64Array::from(vec![-1234567.891]));
+        let d = Arc::new(Int64Array::from(vec![2]));
+
+        let args = create_args(vec![x, d]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "-1,234,567.89");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_format_small_numbers() {
+        let function = FormatFunction::default();
+
+        let x = Arc::new(Float64Array::from(vec![0.5, 12.345, 123.0]));
+        let d = Arc::new(Int64Array::from(vec![2, 2, 0]));
+
+        let args = create_args(vec![x, d]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "0.50");
+            assert_eq!(str_array.value(1), "12.35"); // rounded
+            assert_eq!(str_array.value(2), "123");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_format_with_nulls() {
+        let function = FormatFunction::default();
+
+        let x = Arc::new(Float64Array::from(vec![Some(1234.5), None]));
+        let d = Arc::new(Int64Array::from(vec![2, 2]));
+
+        let args = create_args(vec![x, d]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "1,234.50");
+            assert!(str_array.is_null(1));
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_add_thousand_separators() {
+        assert_eq!(add_thousand_separators("1"), "1");
+        assert_eq!(add_thousand_separators("12"), "12");
+        assert_eq!(add_thousand_separators("123"), "123");
+        assert_eq!(add_thousand_separators("1234"), "1,234");
+        assert_eq!(add_thousand_separators("12345"), "12,345");
+        assert_eq!(add_thousand_separators("123456"), "123,456");
+        assert_eq!(add_thousand_separators("1234567"), "1,234,567");
+        assert_eq!(add_thousand_separators("12345678"), "12,345,678");
+        assert_eq!(add_thousand_separators("123456789"), "123,456,789");
+    }
+
+    #[test]
+    fn test_format_large_int_no_float_precision_loss() {
+        let function = FormatFunction::default();
+
+        // 2^53 + 1 cannot be represented exactly as f64.
+        let x = Arc::new(Int64Array::from(vec![9_007_199_254_740_993i64]));
+        let d = Arc::new(Int64Array::from(vec![0]));
+
+        let args = create_args(vec![x, d]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "9,007,199,254,740,993");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_format_decimal_places_u64_overflow_clamps() {
+        use datafusion_common::arrow::array::UInt64Array;
+
+        let function = FormatFunction::default();
+
+        let x = Arc::new(Int64Array::from(vec![1]));
+        let d = Arc::new(UInt64Array::from(vec![u64::MAX]));
+
+        let args = create_args(vec![x, d]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), format!("1.{}", "0".repeat(30)));
+        } else {
+            panic!("Expected array result");
+        }
+    }
+}
--- a/src/common/function/src/scalars/string/insert.rs
+++ b/src/common/function/src/scalars/string/insert.rs
@@ -0,0 +1,345 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! MySQL-compatible INSERT function implementation.
+//!
+//! INSERT(str, pos, len, newstr) - Inserts newstr into str at position pos,
+//! replacing len characters.
+
+use std::fmt;
+use std::sync::Arc;
+
+use datafusion_common::DataFusionError;
+use datafusion_common::arrow::array::{Array, ArrayRef, AsArray, LargeStringBuilder};
+use datafusion_common::arrow::compute::cast;
+use datafusion_common::arrow::datatypes::DataType;
+use datafusion_expr::{ColumnarValue, ScalarFunctionArgs, Signature, TypeSignature, Volatility};
+
+use crate::function::Function;
+use crate::function_registry::FunctionRegistry;
+
+const NAME: &str = "insert";
+
+/// MySQL-compatible INSERT function.
+///
+/// Syntax: INSERT(str, pos, len, newstr)
+/// Returns str with the substring beginning at position pos and len characters long
+/// replaced by newstr.
+///
+/// - pos is 1-based
+/// - If pos is out of range, returns the original string
+/// - If len is out of range, replaces from pos to end of string
+#[derive(Debug)]
+pub struct InsertFunction {
+    signature: Signature,
+}
+
+impl InsertFunction {
+    pub fn register(registry: &FunctionRegistry) {
+        registry.register_scalar(InsertFunction::default());
+    }
+}
+
+impl Default for InsertFunction {
+    fn default() -> Self {
+        let mut signatures = Vec::new();
+        let string_types = [DataType::Utf8, DataType::LargeUtf8, DataType::Utf8View];
+        let int_types = [
+            DataType::Int64,
+            DataType::Int32,
+            DataType::Int16,
+            DataType::Int8,
+            DataType::UInt64,
+            DataType::UInt32,
+            DataType::UInt16,
+            DataType::UInt8,
+        ];
+
+        for str_type in &string_types {
+            for newstr_type in &string_types {
+                for pos_type in &int_types {
+                    for len_type in &int_types {
+                        signatures.push(TypeSignature::Exact(vec![
+                            str_type.clone(),
+                            pos_type.clone(),
+                            len_type.clone(),
+                            newstr_type.clone(),
+                        ]));
+                    }
+                }
+            }
+        }
+
+        Self {
+            signature: Signature::one_of(signatures, Volatility::Immutable),
+        }
+    }
+}
+
+impl fmt::Display for InsertFunction {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "{}", NAME.to_ascii_uppercase())
+    }
+}
+
+impl Function for InsertFunction {
+    fn name(&self) -> &str {
+        NAME
+    }
+
+    fn return_type(&self, _: &[DataType]) -> datafusion_common::Result<DataType> {
+        Ok(DataType::LargeUtf8)
+    }
+
+    fn signature(&self) -> &Signature {
+        &self.signature
+    }
+
+    fn invoke_with_args(
+        &self,
+        args: ScalarFunctionArgs,
+    ) -> datafusion_common::Result<ColumnarValue> {
+        if args.args.len() != 4 {
+            return Err(DataFusionError::Execution(
+                "INSERT requires exactly 4 arguments: INSERT(str, pos, len, newstr)".to_string(),
+            ));
+        }
+
+        let arrays = ColumnarValue::values_to_arrays(&args.args)?;
+        let len = arrays[0].len();
+
+        // Cast string arguments to LargeUtf8
+        let str_array = cast_to_large_utf8(&arrays[0], "str")?;
+        let newstr_array = cast_to_large_utf8(&arrays[3], "newstr")?;
+        let pos_array = cast_to_int64(&arrays[1], "pos")?;
+        let replace_len_array = cast_to_int64(&arrays[2], "len")?;
+
+        let str_arr = str_array.as_string::<i64>();
+        let pos_arr = pos_array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+        let len_arr =
+            replace_len_array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+        let newstr_arr = newstr_array.as_string::<i64>();
+
+        let mut builder = LargeStringBuilder::with_capacity(len, len * 32);
+
+        for i in 0..len {
+            // Check for NULLs
+            if str_arr.is_null(i)
+                || pos_array.is_null(i)
+                || replace_len_array.is_null(i)
+                || newstr_arr.is_null(i)
+            {
+                builder.append_null();
+                continue;
+            }
+
+            let original = str_arr.value(i);
+            let pos = pos_arr.value(i);
+            let replace_len = len_arr.value(i);
+            let new_str = newstr_arr.value(i);
+
+            let result = insert_string(original, pos, replace_len, new_str);
+            builder.append_value(&result);
+        }
+
+        Ok(ColumnarValue::Array(Arc::new(builder.finish())))
+    }
+}
+
+/// Cast array to LargeUtf8 for uniform string access.
+fn cast_to_large_utf8(array: &ArrayRef, name: &str) -> datafusion_common::Result<ArrayRef> {
+    cast(array.as_ref(), &DataType::LargeUtf8)
+        .map_err(|e| DataFusionError::Execution(format!("INSERT: {} cast failed: {}", name, e)))
+}
+
+fn cast_to_int64(array: &ArrayRef, name: &str) -> datafusion_common::Result<ArrayRef> {
+    cast(array.as_ref(), &DataType::Int64)
+        .map_err(|e| DataFusionError::Execution(format!("INSERT: {} cast failed: {}", name, e)))
+}
+
+/// Perform the INSERT string operation.
+/// pos is 1-based. If pos < 1 or pos > len(str) + 1, returns original string.
+fn insert_string(original: &str, pos: i64, replace_len: i64, new_str: &str) -> String {
+    let char_count = original.chars().count();
+
+    // MySQL behavior: if pos < 1 or pos > string length + 1, return original
+    if pos < 1 || pos as usize > char_count + 1 {
+        return original.to_string();
+    }
+
+    let start_idx = (pos - 1) as usize; // Convert to 0-based
+
+    // Calculate end index for replacement
+    let replace_len = if replace_len < 0 {
+        0
+    } else {
+        replace_len as usize
+    };
+    let end_idx = (start_idx + replace_len).min(char_count);
+
+    let start_byte = char_to_byte_idx(original, start_idx);
+    let end_byte = char_to_byte_idx(original, end_idx);
+
+    let mut result = String::with_capacity(original.len() + new_str.len());
+    result.push_str(&original[..start_byte]);
+    result.push_str(new_str);
+    result.push_str(&original[end_byte..]);
+    result
+}
+
+fn char_to_byte_idx(s: &str, char_idx: usize) -> usize {
+    s.char_indices()
+        .nth(char_idx)
+        .map(|(idx, _)| idx)
+        .unwrap_or(s.len())
+}
+
+#[cfg(test)]
+mod tests {
+    use std::sync::Arc;
+
+    use datafusion_common::arrow::array::{Int64Array, StringArray};
+    use datafusion_common::arrow::datatypes::Field;
+    use datafusion_expr::ScalarFunctionArgs;
+
+    use super::*;
+
+    fn create_args(arrays: Vec<ArrayRef>) -> ScalarFunctionArgs {
+        let arg_fields: Vec<_> = arrays
+            .iter()
+            .enumerate()
+            .map(|(i, arr)| {
+                Arc::new(Field::new(
+                    format!("arg_{}", i),
+                    arr.data_type().clone(),
+                    true,
+                ))
+            })
+            .collect();
+
+        ScalarFunctionArgs {
+            args: arrays.iter().cloned().map(ColumnarValue::Array).collect(),
+            arg_fields,
+            return_field: Arc::new(Field::new("result", DataType::LargeUtf8, true)),
+            number_rows: arrays[0].len(),
+            config_options: Arc::new(datafusion_common::config::ConfigOptions::default()),
+        }
+    }
+
+    #[test]
+    fn test_insert_basic() {
+        let function = InsertFunction::default();
+
+        // INSERT('Quadratic', 3, 4, 'What') => 'QuWhattic'
+        let str_arr = Arc::new(StringArray::from(vec!["Quadratic"]));
+        let pos = Arc::new(Int64Array::from(vec![3]));
+        let len = Arc::new(Int64Array::from(vec![4]));
+        let newstr = Arc::new(StringArray::from(vec!["What"]));
+
+        let args = create_args(vec![str_arr, pos, len, newstr]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "QuWhattic");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_insert_out_of_range_pos() {
+        let function = InsertFunction::default();
+
+        // INSERT('Quadratic', 0, 4, 'What') => 'Quadratic' (pos < 1)
+        let str_arr = Arc::new(StringArray::from(vec!["Quadratic", "Quadratic"]));
+        let pos = Arc::new(Int64Array::from(vec![0, 100]));
+        let len = Arc::new(Int64Array::from(vec![4, 4]));
+        let newstr = Arc::new(StringArray::from(vec!["What", "What"]));
+
+        let args = create_args(vec![str_arr, pos, len, newstr]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "Quadratic"); // pos < 1
+            assert_eq!(str_array.value(1), "Quadratic"); // pos > length
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_insert_replace_to_end() {
+        let function = InsertFunction::default();
+
+        // INSERT('Quadratic', 3, 100, 'What') => 'QuWhat' (len exceeds remaining)
+        let str_arr = Arc::new(StringArray::from(vec!["Quadratic"]));
+        let pos = Arc::new(Int64Array::from(vec![3]));
+        let len = Arc::new(Int64Array::from(vec![100]));
+        let newstr = Arc::new(StringArray::from(vec!["What"]));
+
+        let args = create_args(vec![str_arr, pos, len, newstr]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "QuWhat");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_insert_unicode() {
+        let function = InsertFunction::default();
+
+        // INSERT('hello世界', 6, 1, 'の') => 'helloの界'
+        let str_arr = Arc::new(StringArray::from(vec!["hello世界"]));
+        let pos = Arc::new(Int64Array::from(vec![6]));
+        let len = Arc::new(Int64Array::from(vec![1]));
+        let newstr = Arc::new(StringArray::from(vec!["の"]));
+
+        let args = create_args(vec![str_arr, pos, len, newstr]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "helloの界");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_insert_with_nulls() {
+        let function = InsertFunction::default();
+
+        let str_arr = Arc::new(StringArray::from(vec![Some("hello"), None]));
+        let pos = Arc::new(Int64Array::from(vec![1, 1]));
+        let len = Arc::new(Int64Array::from(vec![1, 1]));
+        let newstr = Arc::new(StringArray::from(vec!["X", "X"]));
+
+        let args = create_args(vec![str_arr, pos, len, newstr]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "Xello");
+            assert!(str_array.is_null(1));
+        } else {
+            panic!("Expected array result");
+        }
+    }
+}
--- a/src/common/function/src/scalars/string/locate.rs
+++ b/src/common/function/src/scalars/string/locate.rs
@@ -0,0 +1,373 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! MySQL-compatible LOCATE function implementation.
+//!
+//! LOCATE(substr, str) - Returns the position of the first occurrence of substr in str (1-based).
+//! LOCATE(substr, str, pos) - Returns the position of the first occurrence of substr in str,
+//!                            starting from position pos.
+//! Returns 0 if substr is not found.
+
+use std::fmt;
+use std::sync::Arc;
+
+use datafusion_common::DataFusionError;
+use datafusion_common::arrow::array::{Array, ArrayRef, AsArray, Int64Builder};
+use datafusion_common::arrow::compute::cast;
+use datafusion_common::arrow::datatypes::DataType;
+use datafusion_expr::{ColumnarValue, ScalarFunctionArgs, Signature, TypeSignature, Volatility};
+
+use crate::function::Function;
+use crate::function_registry::FunctionRegistry;
+
+const NAME: &str = "locate";
+
+/// MySQL-compatible LOCATE function.
+///
+/// Syntax:
+/// - LOCATE(substr, str) - Returns 1-based position of substr in str, or 0 if not found.
+/// - LOCATE(substr, str, pos) - Same, but starts searching from position pos.
+#[derive(Debug)]
+pub struct LocateFunction {
+    signature: Signature,
+}
+
+impl LocateFunction {
+    pub fn register(registry: &FunctionRegistry) {
+        registry.register_scalar(LocateFunction::default());
+    }
+}
+
+impl Default for LocateFunction {
+    fn default() -> Self {
+        // Support 2 or 3 arguments with various string types
+        let mut signatures = Vec::new();
+        let string_types = [DataType::Utf8, DataType::LargeUtf8, DataType::Utf8View];
+        let int_types = [
+            DataType::Int64,
+            DataType::Int32,
+            DataType::Int16,
+            DataType::Int8,
+            DataType::UInt64,
+            DataType::UInt32,
+            DataType::UInt16,
+            DataType::UInt8,
+        ];
+
+        // 2-argument form: LOCATE(substr, str)
+        for substr_type in &string_types {
+            for str_type in &string_types {
+                signatures.push(TypeSignature::Exact(vec![
+                    substr_type.clone(),
+                    str_type.clone(),
+                ]));
+            }
+        }
+
+        // 3-argument form: LOCATE(substr, str, pos)
+        for substr_type in &string_types {
+            for str_type in &string_types {
+                for pos_type in &int_types {
+                    signatures.push(TypeSignature::Exact(vec![
+                        substr_type.clone(),
+                        str_type.clone(),
+                        pos_type.clone(),
+                    ]));
+                }
+            }
+        }
+
+        Self {
+            signature: Signature::one_of(signatures, Volatility::Immutable),
+        }
+    }
+}
+
+impl fmt::Display for LocateFunction {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "{}", NAME.to_ascii_uppercase())
+    }
+}
+
+impl Function for LocateFunction {
+    fn name(&self) -> &str {
+        NAME
+    }
+
+    fn return_type(&self, _: &[DataType]) -> datafusion_common::Result<DataType> {
+        Ok(DataType::Int64)
+    }
+
+    fn signature(&self) -> &Signature {
+        &self.signature
+    }
+
+    fn invoke_with_args(
+        &self,
+        args: ScalarFunctionArgs,
+    ) -> datafusion_common::Result<ColumnarValue> {
+        let arg_count = args.args.len();
+        if !(2..=3).contains(&arg_count) {
+            return Err(DataFusionError::Execution(
+                "LOCATE requires 2 or 3 arguments: LOCATE(substr, str) or LOCATE(substr, str, pos)"
+                    .to_string(),
+            ));
+        }
+
+        let arrays = ColumnarValue::values_to_arrays(&args.args)?;
+
+        // Cast string arguments to LargeUtf8 for uniform access
+        let substr_array = cast_to_large_utf8(&arrays[0], "substr")?;
+        let str_array = cast_to_large_utf8(&arrays[1], "str")?;
+
+        let substr = substr_array.as_string::<i64>();
+        let str_arr = str_array.as_string::<i64>();
+        let len = substr.len();
+
+        // Handle optional pos argument
+        let pos_array: Option<ArrayRef> = if arg_count == 3 {
+            Some(cast_to_int64(&arrays[2], "pos")?)
+        } else {
+            None
+        };
+
+        let mut builder = Int64Builder::with_capacity(len);
+
+        for i in 0..len {
+            if substr.is_null(i) || str_arr.is_null(i) {
+                builder.append_null();
+                continue;
+            }
+
+            let needle = substr.value(i);
+            let haystack = str_arr.value(i);
+
+            // Get starting position (1-based in MySQL, convert to 0-based)
+            let start_pos = if let Some(ref pos_arr) = pos_array {
+                if pos_arr.is_null(i) {
+                    builder.append_null();
+                    continue;
+                }
+                let pos = pos_arr
+                    .as_primitive::<datafusion_common::arrow::datatypes::Int64Type>()
+                    .value(i);
+                if pos < 1 {
+                    // MySQL returns 0 for pos < 1
+                    builder.append_value(0);
+                    continue;
+                }
+                (pos - 1) as usize
+            } else {
+                0
+            };
+
+            // Find position using character-based indexing (for Unicode support)
+            let result = locate_substr(haystack, needle, start_pos);
+            builder.append_value(result);
+        }
+
+        Ok(ColumnarValue::Array(Arc::new(builder.finish())))
+    }
+}
+
+/// Cast array to LargeUtf8 for uniform string access.
+fn cast_to_large_utf8(array: &ArrayRef, name: &str) -> datafusion_common::Result<ArrayRef> {
+    cast(array.as_ref(), &DataType::LargeUtf8)
+        .map_err(|e| DataFusionError::Execution(format!("LOCATE: {} cast failed: {}", name, e)))
+}
+
+fn cast_to_int64(array: &ArrayRef, name: &str) -> datafusion_common::Result<ArrayRef> {
+    cast(array.as_ref(), &DataType::Int64)
+        .map_err(|e| DataFusionError::Execution(format!("LOCATE: {} cast failed: {}", name, e)))
+}
+
+/// Find the 1-based position of needle in haystack, starting from start_pos (0-based character index).
+/// Returns 0 if not found.
+fn locate_substr(haystack: &str, needle: &str, start_pos: usize) -> i64 {
+    // Handle empty needle - MySQL returns start_pos + 1
+    if needle.is_empty() {
+        let char_count = haystack.chars().count();
+        return if start_pos <= char_count {
+            (start_pos + 1) as i64
+        } else {
+            0
+        };
+    }
+
+    // Convert start_pos (character index) to byte index
+    let byte_start = haystack
+        .char_indices()
+        .nth(start_pos)
+        .map(|(idx, _)| idx)
+        .unwrap_or(haystack.len());
+
+    if byte_start >= haystack.len() {
+        return 0;
+    }
+
+    // Search in the substring
+    let search_str = &haystack[byte_start..];
+    if let Some(byte_pos) = search_str.find(needle) {
+        // Convert byte position back to character position
+        let char_pos = search_str[..byte_pos].chars().count();
+        // Return 1-based position relative to original string
+        (start_pos + char_pos + 1) as i64
+    } else {
+        0
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::sync::Arc;
+
+    use datafusion_common::arrow::array::StringArray;
+    use datafusion_common::arrow::datatypes::Field;
+    use datafusion_expr::ScalarFunctionArgs;
+
+    use super::*;
+
+    fn create_args(arrays: Vec<ArrayRef>) -> ScalarFunctionArgs {
+        let arg_fields: Vec<_> = arrays
+            .iter()
+            .enumerate()
+            .map(|(i, arr)| {
+                Arc::new(Field::new(
+                    format!("arg_{}", i),
+                    arr.data_type().clone(),
+                    true,
+                ))
+            })
+            .collect();
+
+        ScalarFunctionArgs {
+            args: arrays.iter().cloned().map(ColumnarValue::Array).collect(),
+            arg_fields,
+            return_field: Arc::new(Field::new("result", DataType::Int64, true)),
+            number_rows: arrays[0].len(),
+            config_options: Arc::new(datafusion_common::config::ConfigOptions::default()),
+        }
+    }
+
+    #[test]
+    fn test_locate_basic() {
+        let function = LocateFunction::default();
+
+        let substr = Arc::new(StringArray::from(vec!["world", "xyz", "hello"]));
+        let str_arr = Arc::new(StringArray::from(vec![
+            "hello world",
+            "hello world",
+            "hello world",
+        ]));
+
+        let args = create_args(vec![substr, str_arr]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let int_array = array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+            assert_eq!(int_array.value(0), 7); // "world" at position 7
+            assert_eq!(int_array.value(1), 0); // "xyz" not found
+            assert_eq!(int_array.value(2), 1); // "hello" at position 1
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_locate_with_position() {
+        let function = LocateFunction::default();
+
+        let substr = Arc::new(StringArray::from(vec!["o", "o", "o"]));
+        let str_arr = Arc::new(StringArray::from(vec![
+            "hello world",
+            "hello world",
+            "hello world",
+        ]));
+        let pos = Arc::new(datafusion_common::arrow::array::Int64Array::from(vec![
+            1, 5, 8,
+        ]));
+
+        let args = create_args(vec![substr, str_arr, pos]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let int_array = array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+            assert_eq!(int_array.value(0), 5); // first 'o' at position 5
+            assert_eq!(int_array.value(1), 5); // 'o' at position 5 (start from 5)
+            assert_eq!(int_array.value(2), 8); // 'o' in "world" at position 8
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_locate_unicode() {
+        let function = LocateFunction::default();
+
+        let substr = Arc::new(StringArray::from(vec!["世", "界"]));
+        let str_arr = Arc::new(StringArray::from(vec!["hello世界", "hello世界"]));
+
+        let args = create_args(vec![substr, str_arr]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let int_array = array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+            assert_eq!(int_array.value(0), 6); // "世" at position 6
+            assert_eq!(int_array.value(1), 7); // "界" at position 7
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_locate_empty_needle() {
+        let function = LocateFunction::default();
+
+        let substr = Arc::new(StringArray::from(vec!["", ""]));
+        let str_arr = Arc::new(StringArray::from(vec!["hello", "hello"]));
+        let pos = Arc::new(datafusion_common::arrow::array::Int64Array::from(vec![
+            1, 3,
+        ]));
+
+        let args = create_args(vec![substr, str_arr, pos]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let int_array = array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+            assert_eq!(int_array.value(0), 1); // empty string at pos 1
+            assert_eq!(int_array.value(1), 3); // empty string at pos 3
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_locate_with_nulls() {
+        let function = LocateFunction::default();
+
+        let substr = Arc::new(StringArray::from(vec![Some("o"), None]));
+        let str_arr = Arc::new(StringArray::from(vec![Some("hello"), Some("hello")]));
+
+        let args = create_args(vec![substr, str_arr]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let int_array = array.as_primitive::<datafusion_common::arrow::datatypes::Int64Type>();
+            assert_eq!(int_array.value(0), 5);
+            assert!(int_array.is_null(1));
+        } else {
+            panic!("Expected array result");
+        }
+    }
+}
--- a/src/common/function/src/scalars/string/space.rs
+++ b/src/common/function/src/scalars/string/space.rs
@@ -0,0 +1,252 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+//! MySQL-compatible SPACE function implementation.
+//!
+//! SPACE(N) - Returns a string consisting of N space characters.
+
+use std::fmt;
+use std::sync::Arc;
+
+use datafusion_common::DataFusionError;
+use datafusion_common::arrow::array::{Array, AsArray, LargeStringBuilder};
+use datafusion_common::arrow::datatypes::DataType;
+use datafusion_expr::{ColumnarValue, ScalarFunctionArgs, Signature, TypeSignature, Volatility};
+
+use crate::function::Function;
+use crate::function_registry::FunctionRegistry;
+
+const NAME: &str = "space";
+
+// Safety limit for maximum number of spaces
+const MAX_SPACE_COUNT: i64 = 1024 * 1024; // 1MB of spaces
+
+/// MySQL-compatible SPACE function.
+///
+/// Syntax: SPACE(N)
+/// Returns a string consisting of N space characters.
+/// Returns NULL if N is NULL.
+/// Returns empty string if N < 0.
+#[derive(Debug)]
+pub struct SpaceFunction {
+    signature: Signature,
+}
+
+impl SpaceFunction {
+    pub fn register(registry: &FunctionRegistry) {
+        registry.register_scalar(SpaceFunction::default());
+    }
+}
+
+impl Default for SpaceFunction {
+    fn default() -> Self {
+        Self {
+            signature: Signature::one_of(
+                vec![
+                    TypeSignature::Exact(vec![DataType::Int64]),
+                    TypeSignature::Exact(vec![DataType::Int32]),
+                    TypeSignature::Exact(vec![DataType::Int16]),
+                    TypeSignature::Exact(vec![DataType::Int8]),
+                    TypeSignature::Exact(vec![DataType::UInt64]),
+                    TypeSignature::Exact(vec![DataType::UInt32]),
+                    TypeSignature::Exact(vec![DataType::UInt16]),
+                    TypeSignature::Exact(vec![DataType::UInt8]),
+                ],
+                Volatility::Immutable,
+            ),
+        }
+    }
+}
+
+impl fmt::Display for SpaceFunction {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "{}", NAME.to_ascii_uppercase())
+    }
+}
+
+impl Function for SpaceFunction {
+    fn name(&self) -> &str {
+        NAME
+    }
+
+    fn return_type(&self, _: &[DataType]) -> datafusion_common::Result<DataType> {
+        Ok(DataType::LargeUtf8)
+    }
+
+    fn signature(&self) -> &Signature {
+        &self.signature
+    }
+
+    fn invoke_with_args(
+        &self,
+        args: ScalarFunctionArgs,
+    ) -> datafusion_common::Result<ColumnarValue> {
+        if args.args.len() != 1 {
+            return Err(DataFusionError::Execution(
+                "SPACE requires exactly 1 argument: SPACE(N)".to_string(),
+            ));
+        }
+
+        let arrays = ColumnarValue::values_to_arrays(&args.args)?;
+        let len = arrays[0].len();
+        let n_array = &arrays[0];
+
+        let mut builder = LargeStringBuilder::with_capacity(len, len * 10);
+
+        for i in 0..len {
+            if n_array.is_null(i) {
+                builder.append_null();
+                continue;
+            }
+
+            let n = get_int_value(n_array, i)?;
+
+            if n < 0 {
+                // MySQL returns empty string for negative values
+                builder.append_value("");
+            } else if n > MAX_SPACE_COUNT {
+                return Err(DataFusionError::Execution(format!(
+                    "SPACE: requested {} spaces exceeds maximum allowed ({})",
+                    n, MAX_SPACE_COUNT
+                )));
+            } else {
+                let spaces = " ".repeat(n as usize);
+                builder.append_value(&spaces);
+            }
+        }
+
+        Ok(ColumnarValue::Array(Arc::new(builder.finish())))
+    }
+}
+
+/// Extract integer value from various integer types.
+fn get_int_value(
+    array: &datafusion_common::arrow::array::ArrayRef,
+    index: usize,
+) -> datafusion_common::Result<i64> {
+    use datafusion_common::arrow::datatypes as arrow_types;
+
+    match array.data_type() {
+        DataType::Int64 => Ok(array.as_primitive::<arrow_types::Int64Type>().value(index)),
+        DataType::Int32 => Ok(array.as_primitive::<arrow_types::Int32Type>().value(index) as i64),
+        DataType::Int16 => Ok(array.as_primitive::<arrow_types::Int16Type>().value(index) as i64),
+        DataType::Int8 => Ok(array.as_primitive::<arrow_types::Int8Type>().value(index) as i64),
+        DataType::UInt64 => {
+            let v = array.as_primitive::<arrow_types::UInt64Type>().value(index);
+            if v > i64::MAX as u64 {
+                Err(DataFusionError::Execution(format!(
+                    "SPACE: value {} exceeds maximum",
+                    v
+                )))
+            } else {
+                Ok(v as i64)
+            }
+        }
+        DataType::UInt32 => Ok(array.as_primitive::<arrow_types::UInt32Type>().value(index) as i64),
+        DataType::UInt16 => Ok(array.as_primitive::<arrow_types::UInt16Type>().value(index) as i64),
+        DataType::UInt8 => Ok(array.as_primitive::<arrow_types::UInt8Type>().value(index) as i64),
+        _ => Err(DataFusionError::Execution(format!(
+            "SPACE: unsupported type {:?}",
+            array.data_type()
+        ))),
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::sync::Arc;
+
+    use datafusion_common::arrow::array::Int64Array;
+    use datafusion_common::arrow::datatypes::Field;
+    use datafusion_expr::ScalarFunctionArgs;
+
+    use super::*;
+
+    fn create_args(arrays: Vec<datafusion_common::arrow::array::ArrayRef>) -> ScalarFunctionArgs {
+        let arg_fields: Vec<_> = arrays
+            .iter()
+            .enumerate()
+            .map(|(i, arr)| {
+                Arc::new(Field::new(
+                    format!("arg_{}", i),
+                    arr.data_type().clone(),
+                    true,
+                ))
+            })
+            .collect();
+
+        ScalarFunctionArgs {
+            args: arrays.iter().cloned().map(ColumnarValue::Array).collect(),
+            arg_fields,
+            return_field: Arc::new(Field::new("result", DataType::LargeUtf8, true)),
+            number_rows: arrays[0].len(),
+            config_options: Arc::new(datafusion_common::config::ConfigOptions::default()),
+        }
+    }
+
+    #[test]
+    fn test_space_basic() {
+        let function = SpaceFunction::default();
+
+        let n = Arc::new(Int64Array::from(vec![0, 1, 5]));
+
+        let args = create_args(vec![n]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "");
+            assert_eq!(str_array.value(1), " ");
+            assert_eq!(str_array.value(2), "     ");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_space_negative() {
+        let function = SpaceFunction::default();
+
+        let n = Arc::new(Int64Array::from(vec![-1, -100]));
+
+        let args = create_args(vec![n]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "");
+            assert_eq!(str_array.value(1), "");
+        } else {
+            panic!("Expected array result");
+        }
+    }
+
+    #[test]
+    fn test_space_with_nulls() {
+        let function = SpaceFunction::default();
+
+        let n = Arc::new(Int64Array::from(vec![Some(3), None]));
+
+        let args = create_args(vec![n]);
+        let result = function.invoke_with_args(args).unwrap();
+
+        if let ColumnarValue::Array(array) = result {
+            let str_array = array.as_string::<i64>();
+            assert_eq!(str_array.value(0), "   ");
+            assert!(str_array.is_null(1));
+        } else {
+            panic!("Expected array result");
+        }
+    }
+}
--- a/src/common/memory-manager/src/guard.rs
+++ b/src/common/memory-manager/src/guard.rs
@@ -15,9 +15,14 @@
 use std::{fmt, mem};

 use common_telemetry::debug;
+use snafu::ensure;
 use tokio::sync::{OwnedSemaphorePermit, TryAcquireError};

+use crate::error::{
+    MemoryAcquireTimeoutSnafu, MemoryLimitExceededSnafu, MemorySemaphoreClosedSnafu, Result,
+};
 use crate::manager::{MemoryMetrics, MemoryQuota};
+use crate::policy::OnExhaustedPolicy;

 /// Guard representing a slice of reserved memory.
 pub struct MemoryGuard<M: MemoryMetrics> {
@@ -55,11 +60,52 @@ impl<M: MemoryMetrics> MemoryGuard<M> {
        }
    }

-    /// Tries to allocate additional memory during task execution.
+    /// Acquires additional memory, waiting if necessary until enough is available.
+    ///
+    /// On success, merges the new memory into this guard.
+    ///
+    /// # Errors
+    /// - Returns error if requested bytes would exceed the manager's total limit
+    /// - Returns error if the semaphore is unexpectedly closed
+    pub async fn acquire_additional(&mut self, bytes: u64) -> Result<()> {
+        match &mut self.state {
+            GuardState::Unlimited => Ok(()),
+            GuardState::Limited { permit, quota } => {
+                if bytes == 0 {
+                    return Ok(());
+                }
+
+                let additional_permits = quota.bytes_to_permits(bytes);
+                let current_permits = permit.num_permits() as u32;
+
+                ensure!(
+                    current_permits.saturating_add(additional_permits) <= quota.limit_permits,
+                    MemoryLimitExceededSnafu {
+                        requested_bytes: bytes,
+                        limit_bytes: quota.permits_to_bytes(quota.limit_permits)
+                    }
+                );
+
+                let additional_permit = quota
+                    .semaphore
+                    .clone()
+                    .acquire_many_owned(additional_permits)
+                    .await
+                    .map_err(|_| MemorySemaphoreClosedSnafu.build())?;
+
+                permit.merge(additional_permit);
+                quota.update_in_use_metric();
+                debug!("Acquired additional {} bytes", bytes);
+                Ok(())
+            }
+        }
+    }
+
+    /// Tries to acquire additional memory without waiting.
    ///
    /// On success, merges the new memory into this guard and returns true.
    /// On failure, returns false and leaves this guard unchanged.
-    pub fn request_additional(&mut self, bytes: u64) -> bool {
+    pub fn try_acquire_additional(&mut self, bytes: u64) -> bool {
        match &mut self.state {
            GuardState::Unlimited => true,
            GuardState::Limited { permit, quota } => {
@@ -77,11 +123,11 @@ impl<M: MemoryMetrics> MemoryGuard<M> {
                    Ok(additional_permit) => {
                        permit.merge(additional_permit);
                        quota.update_in_use_metric();
-                        debug!("Allocated additional {} bytes", bytes);
+                        debug!("Acquired additional {} bytes", bytes);
                        true
                    }
                    Err(TryAcquireError::NoPermits) | Err(TryAcquireError::Closed) => {
-                        quota.metrics.inc_rejected("request_additional");
+                        quota.metrics.inc_rejected("try_acquire_additional");
                        false
                    }
                }
@@ -89,11 +135,55 @@ impl<M: MemoryMetrics> MemoryGuard<M> {
        }
    }

-    /// Releases a portion of granted memory back to the pool early,
-    /// before the guard is dropped.
+    /// Acquires additional memory based on the given policy.
+    ///
+    /// - For `OnExhaustedPolicy::Wait`: Waits up to the timeout duration for memory to become available
+    /// - For `OnExhaustedPolicy::Fail`: Returns immediately if memory is not available
+    ///
+    /// # Errors
+    /// - `MemoryLimitExceeded`: Requested bytes would exceed the total limit (both policies), or memory is currently exhausted (Fail policy only)
+    /// - `MemoryAcquireTimeout`: Timeout elapsed while waiting for memory (Wait policy only)
+    /// - `MemorySemaphoreClosed`: The internal semaphore is unexpectedly closed (rare, indicates system issue)
+    pub async fn acquire_additional_with_policy(
+        &mut self,
+        bytes: u64,
+        policy: OnExhaustedPolicy,
+    ) -> Result<()> {
+        match policy {
+            OnExhaustedPolicy::Wait { timeout } => {
+                match tokio::time::timeout(timeout, self.acquire_additional(bytes)).await {
+                    Ok(Ok(())) => Ok(()),
+                    Ok(Err(e)) => Err(e),
+                    Err(_elapsed) => MemoryAcquireTimeoutSnafu {
+                        requested_bytes: bytes,
+                        waited: timeout,
+                    }
+                    .fail(),
+                }
+            }
+            OnExhaustedPolicy::Fail => {
+                if self.try_acquire_additional(bytes) {
+                    Ok(())
+                } else {
+                    MemoryLimitExceededSnafu {
+                        requested_bytes: bytes,
+                        limit_bytes: match &self.state {
+                            GuardState::Unlimited => 0, // unreachable: unlimited mode always succeeds
+                            GuardState::Limited { quota, .. } => {
+                                quota.permits_to_bytes(quota.limit_permits)
+                            }
+                        },
+                    }
+                    .fail()
+                }
+            }
+        }
+    }
+
+    /// Releases a portion of granted memory back to the pool before the guard is dropped.
    ///
    /// Returns true if the release succeeds or is a no-op; false if the request exceeds granted.
-    pub fn early_release_partial(&mut self, bytes: u64) -> bool {
+    pub fn release_partial(&mut self, bytes: u64) -> bool {
        match &mut self.state {
            GuardState::Unlimited => true,
            GuardState::Limited { permit, quota } => {
@@ -109,7 +199,7 @@ impl<M: MemoryMetrics> MemoryGuard<M> {
                            quota.permits_to_bytes(released_permit.num_permits() as u32);
                        drop(released_permit);
                        quota.update_in_use_metric();
-                        debug!("Early released {} bytes from memory guard", released_bytes);
+                        debug!("Released {} bytes from memory guard", released_bytes);
                        true
                    }
                    None => false,
--- a/src/common/memory-manager/src/manager.rs
+++ b/src/common/memory-manager/src/manager.rs
@@ -37,6 +37,12 @@ pub struct MemoryManager<M: MemoryMetrics> {
    quota: Option<MemoryQuota<M>>,
 }

+impl<M: MemoryMetrics + Default> Default for MemoryManager<M> {
+    fn default() -> Self {
+        Self::new(0, M::default())
+    }
+}
+
 #[derive(Clone)]
 pub(crate) struct MemoryQuota<M: MemoryMetrics> {
    pub(crate) semaphore: Arc<Semaphore>,
--- a/src/common/memory-manager/src/tests.rs
+++ b/src/common/memory-manager/src/tests.rs
@@ -83,7 +83,7 @@ fn test_request_additional_success() {
    assert_eq!(manager.used_bytes(), base);

    // Request additional memory (3MB) - should succeed and merge
-    assert!(guard.request_additional(3 * PERMIT_GRANULARITY_BYTES));
+    assert!(guard.try_acquire_additional(3 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(guard.granted_bytes(), 8 * PERMIT_GRANULARITY_BYTES);
    assert_eq!(manager.used_bytes(), 8 * PERMIT_GRANULARITY_BYTES);
 }
@@ -98,11 +98,11 @@ fn test_request_additional_exceeds_limit() {
    let mut guard = manager.try_acquire(base).unwrap();

    // Request additional memory (3MB) - should succeed
-    assert!(guard.request_additional(3 * PERMIT_GRANULARITY_BYTES));
+    assert!(guard.try_acquire_additional(3 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(manager.used_bytes(), 8 * PERMIT_GRANULARITY_BYTES);

    // Request more (3MB) - should fail (would exceed 10MB limit)
-    let result = guard.request_additional(3 * PERMIT_GRANULARITY_BYTES);
+    let result = guard.try_acquire_additional(3 * PERMIT_GRANULARITY_BYTES);
    assert!(!result);

    // Still at 8MB
@@ -119,7 +119,7 @@ fn test_request_additional_auto_release_on_guard_drop() {
        let mut guard = manager.try_acquire(5 * PERMIT_GRANULARITY_BYTES).unwrap();

        // Request additional - memory is merged into guard
-        assert!(guard.request_additional(3 * PERMIT_GRANULARITY_BYTES));
+        assert!(guard.try_acquire_additional(3 * PERMIT_GRANULARITY_BYTES));
        assert_eq!(manager.used_bytes(), 8 * PERMIT_GRANULARITY_BYTES);

        // When guard drops, all memory (base + additional) is released together
@@ -135,7 +135,7 @@ fn test_request_additional_unlimited() {
    let mut guard = manager.try_acquire(5 * PERMIT_GRANULARITY_BYTES).unwrap();

    // Should always succeed with unlimited manager
-    assert!(guard.request_additional(100 * PERMIT_GRANULARITY_BYTES));
+    assert!(guard.try_acquire_additional(100 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(guard.granted_bytes(), 0);
    assert_eq!(manager.used_bytes(), 0);
 }
@@ -148,7 +148,7 @@ fn test_request_additional_zero_bytes() {
    let mut guard = manager.try_acquire(5 * PERMIT_GRANULARITY_BYTES).unwrap();

    // Request 0 bytes should succeed without affecting anything
-    assert!(guard.request_additional(0));
+    assert!(guard.try_acquire_additional(0));
    assert_eq!(guard.granted_bytes(), 5 * PERMIT_GRANULARITY_BYTES);
    assert_eq!(manager.used_bytes(), 5 * PERMIT_GRANULARITY_BYTES);
 }
@@ -162,7 +162,7 @@ fn test_early_release_partial_success() {
    assert_eq!(manager.used_bytes(), 8 * PERMIT_GRANULARITY_BYTES);

    // Release half
-    assert!(guard.early_release_partial(4 * PERMIT_GRANULARITY_BYTES));
+    assert!(guard.release_partial(4 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(guard.granted_bytes(), 4 * PERMIT_GRANULARITY_BYTES);
    assert_eq!(manager.used_bytes(), 4 * PERMIT_GRANULARITY_BYTES);

@@ -177,7 +177,7 @@ fn test_early_release_partial_exceeds_granted() {
    let mut guard = manager.try_acquire(5 * PERMIT_GRANULARITY_BYTES).unwrap();

    // Try to release more than granted - should fail
-    assert!(!guard.early_release_partial(10 * PERMIT_GRANULARITY_BYTES));
+    assert!(!guard.release_partial(10 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(guard.granted_bytes(), 5 * PERMIT_GRANULARITY_BYTES);
    assert_eq!(manager.used_bytes(), 5 * PERMIT_GRANULARITY_BYTES);
 }
@@ -188,7 +188,7 @@ fn test_early_release_partial_unlimited() {
    let mut guard = manager.try_acquire(100 * PERMIT_GRANULARITY_BYTES).unwrap();

    // Unlimited guard - release should succeed (no-op)
-    assert!(guard.early_release_partial(50 * PERMIT_GRANULARITY_BYTES));
+    assert!(guard.release_partial(50 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(guard.granted_bytes(), 0);
 }

@@ -200,22 +200,22 @@ fn test_request_and_early_release_symmetry() {
    let mut guard = manager.try_acquire(5 * PERMIT_GRANULARITY_BYTES).unwrap();

    // Request additional
-    assert!(guard.request_additional(5 * PERMIT_GRANULARITY_BYTES));
+    assert!(guard.try_acquire_additional(5 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(guard.granted_bytes(), 10 * PERMIT_GRANULARITY_BYTES);
    assert_eq!(manager.used_bytes(), 10 * PERMIT_GRANULARITY_BYTES);

    // Early release some
-    assert!(guard.early_release_partial(3 * PERMIT_GRANULARITY_BYTES));
+    assert!(guard.release_partial(3 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(guard.granted_bytes(), 7 * PERMIT_GRANULARITY_BYTES);
    assert_eq!(manager.used_bytes(), 7 * PERMIT_GRANULARITY_BYTES);

    // Request again
-    assert!(guard.request_additional(2 * PERMIT_GRANULARITY_BYTES));
+    assert!(guard.try_acquire_additional(2 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(guard.granted_bytes(), 9 * PERMIT_GRANULARITY_BYTES);
    assert_eq!(manager.used_bytes(), 9 * PERMIT_GRANULARITY_BYTES);

    // Early release again
-    assert!(guard.early_release_partial(4 * PERMIT_GRANULARITY_BYTES));
+    assert!(guard.release_partial(4 * PERMIT_GRANULARITY_BYTES));
    assert_eq!(guard.granted_bytes(), 5 * PERMIT_GRANULARITY_BYTES);
    assert_eq!(manager.used_bytes(), 5 * PERMIT_GRANULARITY_BYTES);

@@ -226,25 +226,186 @@ fn test_request_and_early_release_symmetry() {
 #[test]
 fn test_small_allocation_rounds_up() {
    // Test that allocations smaller than PERMIT_GRANULARITY_BYTES
-    // round up to 1 permit and can use request_additional()
+    // round up to 1 permit and can use try_acquire_additional()
    let limit = 10 * PERMIT_GRANULARITY_BYTES;
    let manager = MemoryManager::new(limit, NoOpMetrics);

    let mut guard = manager.try_acquire(512 * 1024).unwrap(); // 512KB
    assert_eq!(guard.granted_bytes(), PERMIT_GRANULARITY_BYTES); // Rounds up to 1MB
-    assert!(guard.request_additional(2 * PERMIT_GRANULARITY_BYTES)); // Can request more
+    assert!(guard.try_acquire_additional(2 * PERMIT_GRANULARITY_BYTES)); // Can request more
    assert_eq!(guard.granted_bytes(), 3 * PERMIT_GRANULARITY_BYTES);
 }

 #[test]
 fn test_acquire_zero_bytes_lazy_allocation() {
-    // Test that acquire(0) returns 0 permits but can request_additional() later
+    // Test that acquire(0) returns 0 permits but can try_acquire_additional() later
    let manager = MemoryManager::new(10 * PERMIT_GRANULARITY_BYTES, NoOpMetrics);

    let mut guard = manager.try_acquire(0).unwrap();
    assert_eq!(guard.granted_bytes(), 0); // No permits consumed
    assert_eq!(manager.used_bytes(), 0);

-    assert!(guard.request_additional(3 * PERMIT_GRANULARITY_BYTES)); // Lazy allocation
+    assert!(guard.try_acquire_additional(3 * PERMIT_GRANULARITY_BYTES)); // Lazy allocation
    assert_eq!(guard.granted_bytes(), 3 * PERMIT_GRANULARITY_BYTES);
 }
+
+#[tokio::test(flavor = "current_thread")]
+async fn test_acquire_additional_blocks_and_unblocks() {
+    let limit = 10 * PERMIT_GRANULARITY_BYTES;
+    let manager = MemoryManager::new(limit, NoOpMetrics);
+
+    // First guard takes 9MB, leaving only 1MB available
+    let mut guard1 = manager.try_acquire(9 * PERMIT_GRANULARITY_BYTES).unwrap();
+    assert_eq!(manager.used_bytes(), 9 * PERMIT_GRANULARITY_BYTES);
+
+    // Spawn a task that will block trying to acquire additional 5MB (needs total 10MB available)
+    let manager_clone = manager.clone();
+    let waiter = tokio::spawn(async move {
+        let mut guard2 = manager_clone.try_acquire(0).unwrap();
+        // This will block until enough memory is available
+        guard2
+            .acquire_additional(5 * PERMIT_GRANULARITY_BYTES)
+            .await
+            .unwrap();
+        guard2
+    });
+
+    sleep(Duration::from_millis(10)).await;
+
+    // Release 5MB from guard1 - this should unblock the waiter
+    assert!(guard1.release_partial(5 * PERMIT_GRANULARITY_BYTES));
+
+    // Waiter should complete now
+    let guard2 = waiter.await.unwrap();
+    assert_eq!(guard2.granted_bytes(), 5 * PERMIT_GRANULARITY_BYTES);
+
+    // Total: guard1 has 4MB, guard2 has 5MB = 9MB
+    assert_eq!(manager.used_bytes(), 9 * PERMIT_GRANULARITY_BYTES);
+}
+
+#[tokio::test(flavor = "current_thread")]
+async fn test_acquire_additional_exceeds_total_limit() {
+    let limit = 10 * PERMIT_GRANULARITY_BYTES;
+    let manager = MemoryManager::new(limit, NoOpMetrics);
+
+    let mut guard = manager.try_acquire(8 * PERMIT_GRANULARITY_BYTES).unwrap();
+
+    // Try to acquire additional 5MB - would exceed total limit of 10MB
+    let result = guard.acquire_additional(5 * PERMIT_GRANULARITY_BYTES).await;
+    assert!(result.is_err());
+
+    // Guard should remain unchanged
+    assert_eq!(guard.granted_bytes(), 8 * PERMIT_GRANULARITY_BYTES);
+    assert_eq!(manager.used_bytes(), 8 * PERMIT_GRANULARITY_BYTES);
+}
+
+#[tokio::test(flavor = "current_thread")]
+async fn test_acquire_additional_success() {
+    let limit = 10 * PERMIT_GRANULARITY_BYTES;
+    let manager = MemoryManager::new(limit, NoOpMetrics);
+
+    let mut guard = manager.try_acquire(3 * PERMIT_GRANULARITY_BYTES).unwrap();
+    assert_eq!(manager.used_bytes(), 3 * PERMIT_GRANULARITY_BYTES);
+
+    // Acquire additional 4MB - should succeed
+    guard
+        .acquire_additional(4 * PERMIT_GRANULARITY_BYTES)
+        .await
+        .unwrap();
+    assert_eq!(guard.granted_bytes(), 7 * PERMIT_GRANULARITY_BYTES);
+    assert_eq!(manager.used_bytes(), 7 * PERMIT_GRANULARITY_BYTES);
+}
+
+#[tokio::test(flavor = "current_thread")]
+async fn test_acquire_additional_with_policy_wait_success() {
+    use crate::policy::OnExhaustedPolicy;
+
+    let limit = 10 * PERMIT_GRANULARITY_BYTES;
+    let manager = MemoryManager::new(limit, NoOpMetrics);
+
+    let mut guard1 = manager.try_acquire(8 * PERMIT_GRANULARITY_BYTES).unwrap();
+
+    let manager_clone = manager.clone();
+    let waiter = tokio::spawn(async move {
+        let mut guard2 = manager_clone.try_acquire(0).unwrap();
+        // Wait policy with 1 second timeout
+        guard2
+            .acquire_additional_with_policy(
+                5 * PERMIT_GRANULARITY_BYTES,
+                OnExhaustedPolicy::Wait {
+                    timeout: Duration::from_secs(1),
+                },
+            )
+            .await
+            .unwrap();
+        guard2
+    });
+
+    sleep(Duration::from_millis(10)).await;
+
+    // Release memory to unblock waiter
+    assert!(guard1.release_partial(5 * PERMIT_GRANULARITY_BYTES));
+
+    let guard2 = waiter.await.unwrap();
+    assert_eq!(guard2.granted_bytes(), 5 * PERMIT_GRANULARITY_BYTES);
+}
+
+#[tokio::test(flavor = "current_thread")]
+async fn test_acquire_additional_with_policy_wait_timeout() {
+    use crate::policy::OnExhaustedPolicy;
+
+    let limit = 10 * PERMIT_GRANULARITY_BYTES;
+    let manager = MemoryManager::new(limit, NoOpMetrics);
+
+    // Take all memory
+    let _guard1 = manager.try_acquire(10 * PERMIT_GRANULARITY_BYTES).unwrap();
+
+    let mut guard2 = manager.try_acquire(0).unwrap();
+
+    // Try to acquire with short timeout - should timeout
+    let result = guard2
+        .acquire_additional_with_policy(
+            5 * PERMIT_GRANULARITY_BYTES,
+            OnExhaustedPolicy::Wait {
+                timeout: Duration::from_millis(50),
+            },
+        )
+        .await;
+
+    assert!(result.is_err());
+    assert_eq!(guard2.granted_bytes(), 0);
+}
+
+#[tokio::test(flavor = "current_thread")]
+async fn test_acquire_additional_with_policy_fail() {
+    use crate::policy::OnExhaustedPolicy;
+
+    let limit = 10 * PERMIT_GRANULARITY_BYTES;
+    let manager = MemoryManager::new(limit, NoOpMetrics);
+
+    let _guard1 = manager.try_acquire(8 * PERMIT_GRANULARITY_BYTES).unwrap();
+
+    let mut guard2 = manager.try_acquire(0).unwrap();
+
+    // Fail policy - should return error immediately
+    let result = guard2
+        .acquire_additional_with_policy(5 * PERMIT_GRANULARITY_BYTES, OnExhaustedPolicy::Fail)
+        .await;
+
+    assert!(result.is_err());
+    assert_eq!(guard2.granted_bytes(), 0);
+}
+
+#[tokio::test(flavor = "current_thread")]
+async fn test_acquire_additional_unlimited() {
+    let manager = MemoryManager::new(0, NoOpMetrics); // Unlimited
+    let mut guard = manager.try_acquire(0).unwrap();
+
+    // Should always succeed with unlimited manager
+    guard
+        .acquire_additional(1000 * PERMIT_GRANULARITY_BYTES)
+        .await
+        .unwrap();
+    assert_eq!(guard.granted_bytes(), 0);
+    assert_eq!(manager.used_bytes(), 0);
+}
--- a/src/common/meta/src/instruction.rs
+++ b/src/common/meta/src/instruction.rs
@@ -514,6 +514,22 @@ impl Display for GcRegionsReply {
    }
 }

+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
+pub struct EnterStagingRegion {
+    pub region_id: RegionId,
+    pub partition_expr: String,
+}
+
+impl Display for EnterStagingRegion {
+    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+        write!(
+            f,
+            "EnterStagingRegion(region_id={}, partition_expr={})",
+            self.region_id, self.partition_expr
+        )
+    }
+}
+
 #[derive(Debug, Clone, Serialize, Deserialize, Display, PartialEq)]
 pub enum Instruction {
    /// Opens regions.
@@ -541,6 +557,8 @@ pub enum Instruction {
    GcRegions(GcRegions),
    /// Temporary suspend serving reads or writes
    Suspend,
+    /// Makes regions enter staging state.
+    EnterStagingRegions(Vec<EnterStagingRegion>),
 }

 impl Instruction {
@@ -597,6 +615,13 @@ impl Instruction {
            _ => None,
        }
    }
+
+    pub fn into_enter_staging_regions(self) -> Option<Vec<EnterStagingRegion>> {
+        match self {
+            Self::EnterStagingRegions(enter_staging) => Some(enter_staging),
+            _ => None,
+        }
+    }
 }

 /// The reply of [UpgradeRegion].
@@ -690,6 +715,28 @@ where
    })
 }

+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
+pub struct EnterStagingRegionReply {
+    pub region_id: RegionId,
+    /// Returns true if the region is under the new region rule.
+    pub ready: bool,
+    /// Indicates whether the region exists.
+    pub exists: bool,
+    /// Return error if any during the operation.
+    pub error: Option<String>,
+}
+
+#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
+pub struct EnterStagingRegionsReply {
+    pub replies: Vec<EnterStagingRegionReply>,
+}
+
+impl EnterStagingRegionsReply {
+    pub fn new(replies: Vec<EnterStagingRegionReply>) -> Self {
+        Self { replies }
+    }
+}
+
 #[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
 #[serde(tag = "type", rename_all = "snake_case")]
 pub enum InstructionReply {
@@ -710,6 +757,7 @@ pub enum InstructionReply {
    FlushRegions(FlushRegionReply),
    GetFileRefs(GetFileRefsReply),
    GcRegions(GcRegionsReply),
+    EnterStagingRegions(EnterStagingRegionsReply),
 }

 impl Display for InstructionReply {
@@ -726,6 +774,13 @@ impl Display for InstructionReply {
            Self::FlushRegions(reply) => write!(f, "InstructionReply::FlushRegions({})", reply),
            Self::GetFileRefs(reply) => write!(f, "InstructionReply::GetFileRefs({})", reply),
            Self::GcRegions(reply) => write!(f, "InstructionReply::GcRegion({})", reply),
+            Self::EnterStagingRegions(reply) => {
+                write!(
+                    f,
+                    "InstructionReply::EnterStagingRegions({:?})",
+                    reply.replies
+                )
+            }
        }
    }
 }
@@ -766,13 +821,20 @@ impl InstructionReply {
            _ => panic!("Expected FlushRegions reply"),
        }
    }
+
+    pub fn expect_enter_staging_regions_reply(self) -> Vec<EnterStagingRegionReply> {
+        match self {
+            Self::EnterStagingRegions(reply) => reply.replies,
+            _ => panic!("Expected EnterStagingRegion reply"),
+        }
+    }
 }

 #[cfg(test)]
 mod tests {
    use std::collections::HashSet;

-    use store_api::storage::FileId;
+    use store_api::storage::{FileId, FileRef};

    use super::*;

@@ -1147,12 +1209,14 @@ mod tests {
        let mut manifest = FileRefsManifest::default();
        let r0 = RegionId::new(1024, 1);
        let r1 = RegionId::new(1024, 2);
-        manifest
-            .file_refs
-            .insert(r0, HashSet::from([FileId::random()]));
-        manifest
-            .file_refs
-            .insert(r1, HashSet::from([FileId::random()]));
+        manifest.file_refs.insert(
+            r0,
+            HashSet::from([FileRef::new(r0, FileId::random(), None)]),
+        );
+        manifest.file_refs.insert(
+            r1,
+            HashSet::from([FileRef::new(r1, FileId::random(), None)]),
+        );
        manifest.manifest_version.insert(r0, 10);
        manifest.manifest_version.insert(r1, 20);

--- a/src/common/meta/src/kv_backend/rds/postgres.rs
+++ b/src/common/meta/src/kv_backend/rds/postgres.rs
@@ -848,7 +848,7 @@ impl PgStore {
                .context(CreatePostgresPoolSnafu)?,
        };

-        Self::with_pg_pool(pool, None, table_name, max_txn_ops).await
+        Self::with_pg_pool(pool, None, table_name, max_txn_ops, false).await
    }

    /// Create [PgStore] impl of [KvBackendRef] from url (backward compatibility).
@@ -862,6 +862,7 @@ impl PgStore {
        schema_name: Option<&str>,
        table_name: &str,
        max_txn_ops: usize,
+        auto_create_schema: bool,
    ) -> Result<KvBackendRef> {
        // Ensure the postgres metadata backend is ready to use.
        let client = match pool.get().await {
@@ -873,9 +874,23 @@ impl PgStore {
                .fail();
            }
        };
+
+        // Automatically create schema if enabled and schema_name is provided.
+        if auto_create_schema
+            && let Some(schema) = schema_name
+            && !schema.is_empty()
+        {
+            let create_schema_sql = format!("CREATE SCHEMA IF NOT EXISTS \"{}\"", schema);
+            client
+                .execute(&create_schema_sql, &[])
+                .await
+                .with_context(|_| PostgresExecutionSnafu {
+                    sql: create_schema_sql.clone(),
+                })?;
+        }
+
        let template_factory = PgSqlTemplateFactory::new(schema_name, table_name);
        let sql_template_set = template_factory.build();
-        // Do not attempt to create schema implicitly.
        client
            .execute(&sql_template_set.create_table_statement, &[])
            .await
@@ -959,7 +974,7 @@ mod tests {
        let Some(pool) = build_pg15_pool().await else {
            return;
        };
-        let res = PgStore::with_pg_pool(pool, None, "pg15_public_should_fail", 128).await;
+        let res = PgStore::with_pg_pool(pool, None, "pg15_public_should_fail", 128, false).await;
        assert!(
            res.is_err(),
            "creating table in public should fail for test_user"
@@ -1214,4 +1229,249 @@ mod tests {
        let t = PgSqlTemplateFactory::format_table_ident(Some(""), "test_table");
        assert_eq!(t, "\"test_table\"");
    }
+
+    #[tokio::test]
+    async fn test_auto_create_schema_enabled() {
+        common_telemetry::init_default_ut_logging();
+        maybe_skip_postgres_integration_test!();
+        let endpoints = std::env::var("GT_POSTGRES_ENDPOINTS").unwrap();
+        let mut cfg = Config::new();
+        cfg.url = Some(endpoints);
+        let pool = cfg
+            .create_pool(Some(Runtime::Tokio1), NoTls)
+            .context(CreatePostgresPoolSnafu)
+            .unwrap();
+
+        let schema_name = "test_auto_create_enabled";
+        let table_name = "test_table";
+
+        // Drop the schema if it exists to start clean
+        let client = pool.get().await.unwrap();
+        let _ = client
+            .execute(
+                &format!("DROP SCHEMA IF EXISTS \"{}\" CASCADE", schema_name),
+                &[],
+            )
+            .await;
+
+        // Create store with auto_create_schema enabled
+        let _ = PgStore::with_pg_pool(pool.clone(), Some(schema_name), table_name, 128, true)
+            .await
+            .unwrap();
+
+        // Verify schema was created
+        let row = client
+            .query_one(
+                "SELECT schema_name FROM information_schema.schemata WHERE schema_name = $1",
+                &[&schema_name],
+            )
+            .await
+            .unwrap();
+        let created_schema: String = row.get(0);
+        assert_eq!(created_schema, schema_name);
+
+        // Verify table was created in the schema
+        let row = client
+            .query_one(
+                "SELECT table_schema, table_name FROM information_schema.tables WHERE table_schema = $1 AND table_name = $2",
+                &[&schema_name, &table_name],
+            )
+            .await
+            .unwrap();
+        let created_table_schema: String = row.get(0);
+        let created_table_name: String = row.get(1);
+        assert_eq!(created_table_schema, schema_name);
+        assert_eq!(created_table_name, table_name);
+
+        // Cleanup
+        let _ = client
+            .execute(
+                &format!("DROP SCHEMA IF EXISTS \"{}\" CASCADE", schema_name),
+                &[],
+            )
+            .await;
+    }
+
+    #[tokio::test]
+    async fn test_auto_create_schema_disabled() {
+        common_telemetry::init_default_ut_logging();
+        maybe_skip_postgres_integration_test!();
+        let endpoints = std::env::var("GT_POSTGRES_ENDPOINTS").unwrap();
+        let mut cfg = Config::new();
+        cfg.url = Some(endpoints);
+        let pool = cfg
+            .create_pool(Some(Runtime::Tokio1), NoTls)
+            .context(CreatePostgresPoolSnafu)
+            .unwrap();
+
+        let schema_name = "test_auto_create_disabled";
+        let table_name = "test_table";
+
+        // Drop the schema if it exists to start clean
+        let client = pool.get().await.unwrap();
+        let _ = client
+            .execute(
+                &format!("DROP SCHEMA IF EXISTS \"{}\" CASCADE", schema_name),
+                &[],
+            )
+            .await;
+
+        // Try to create store with auto_create_schema disabled (should fail)
+        let result =
+            PgStore::with_pg_pool(pool.clone(), Some(schema_name), table_name, 128, false).await;
+
+        // Verify it failed because schema doesn't exist
+        assert!(
+            result.is_err(),
+            "Expected error when schema doesn't exist and auto_create_schema is disabled"
+        );
+    }
+
+    #[tokio::test]
+    async fn test_auto_create_schema_already_exists() {
+        common_telemetry::init_default_ut_logging();
+        maybe_skip_postgres_integration_test!();
+        let endpoints = std::env::var("GT_POSTGRES_ENDPOINTS").unwrap();
+        let mut cfg = Config::new();
+        cfg.url = Some(endpoints);
+        let pool = cfg
+            .create_pool(Some(Runtime::Tokio1), NoTls)
+            .context(CreatePostgresPoolSnafu)
+            .unwrap();
+
+        let schema_name = "test_auto_create_existing";
+        let table_name = "test_table";
+
+        // Manually create the schema first
+        let client = pool.get().await.unwrap();
+        let _ = client
+            .execute(
+                &format!("DROP SCHEMA IF EXISTS \"{}\" CASCADE", schema_name),
+                &[],
+            )
+            .await;
+        client
+            .execute(&format!("CREATE SCHEMA \"{}\"", schema_name), &[])
+            .await
+            .unwrap();
+
+        // Create store with auto_create_schema enabled (should succeed idempotently)
+        let _ = PgStore::with_pg_pool(pool.clone(), Some(schema_name), table_name, 128, true)
+            .await
+            .unwrap();
+
+        // Verify schema still exists
+        let row = client
+            .query_one(
+                "SELECT schema_name FROM information_schema.schemata WHERE schema_name = $1",
+                &[&schema_name],
+            )
+            .await
+            .unwrap();
+        let created_schema: String = row.get(0);
+        assert_eq!(created_schema, schema_name);
+
+        // Verify table was created in the schema
+        let row = client
+            .query_one(
+                "SELECT table_schema, table_name FROM information_schema.tables WHERE table_schema = $1 AND table_name = $2",
+                &[&schema_name, &table_name],
+            )
+            .await
+            .unwrap();
+        let created_table_schema: String = row.get(0);
+        let created_table_name: String = row.get(1);
+        assert_eq!(created_table_schema, schema_name);
+        assert_eq!(created_table_name, table_name);
+
+        // Cleanup
+        let _ = client
+            .execute(
+                &format!("DROP SCHEMA IF EXISTS \"{}\" CASCADE", schema_name),
+                &[],
+            )
+            .await;
+    }
+
+    #[tokio::test]
+    async fn test_auto_create_schema_no_schema_name() {
+        common_telemetry::init_default_ut_logging();
+        maybe_skip_postgres_integration_test!();
+        let endpoints = std::env::var("GT_POSTGRES_ENDPOINTS").unwrap();
+        let mut cfg = Config::new();
+        cfg.url = Some(endpoints);
+        let pool = cfg
+            .create_pool(Some(Runtime::Tokio1), NoTls)
+            .context(CreatePostgresPoolSnafu)
+            .unwrap();
+
+        let table_name = "test_table_no_schema";
+
+        // Create store with auto_create_schema enabled but no schema name (should succeed)
+        // This should create the table in the default schema (public)
+        let _ = PgStore::with_pg_pool(pool.clone(), None, table_name, 128, true)
+            .await
+            .unwrap();
+
+        // Verify table was created in public schema
+        let client = pool.get().await.unwrap();
+        let row = client
+            .query_one(
+                "SELECT table_schema, table_name FROM information_schema.tables WHERE table_name = $1",
+                &[&table_name],
+            )
+            .await
+            .unwrap();
+        let created_table_schema: String = row.get(0);
+        let created_table_name: String = row.get(1);
+        assert_eq!(created_table_name, table_name);
+        // Verify it's in public schema (or whichever is the default)
+        assert!(created_table_schema == "public" || !created_table_schema.is_empty());
+
+        // Cleanup
+        let _ = client
+            .execute(&format!("DROP TABLE IF EXISTS \"{}\"", table_name), &[])
+            .await;
+    }
+
+    #[tokio::test]
+    async fn test_auto_create_schema_with_empty_schema_name() {
+        common_telemetry::init_default_ut_logging();
+        maybe_skip_postgres_integration_test!();
+        let endpoints = std::env::var("GT_POSTGRES_ENDPOINTS").unwrap();
+        let mut cfg = Config::new();
+        cfg.url = Some(endpoints);
+        let pool = cfg
+            .create_pool(Some(Runtime::Tokio1), NoTls)
+            .context(CreatePostgresPoolSnafu)
+            .unwrap();
+
+        let table_name = "test_table_empty_schema";
+
+        // Create store with auto_create_schema enabled but empty schema name (should succeed)
+        // This should create the table in the default schema (public)
+        let _ = PgStore::with_pg_pool(pool.clone(), Some(""), table_name, 128, true)
+            .await
+            .unwrap();
+
+        // Verify table was created in public schema
+        let client = pool.get().await.unwrap();
+        let row = client
+            .query_one(
+                "SELECT table_schema, table_name FROM information_schema.tables WHERE table_name = $1",
+                &[&table_name],
+            )
+            .await
+            .unwrap();
+        let created_table_schema: String = row.get(0);
+        let created_table_name: String = row.get(1);
+        assert_eq!(created_table_name, table_name);
+        // Verify it's in public schema (or whichever is the default)
+        assert!(created_table_schema == "public" || !created_table_schema.is_empty());
+
+        // Cleanup
+        let _ = client
+            .execute(&format!("DROP TABLE IF EXISTS \"{}\"", table_name), &[])
+            .await;
+    }
 }
--- a/src/common/sql/Cargo.toml
+++ b/src/common/sql/Cargo.toml
@@ -5,10 +5,12 @@ edition.workspace = true
 license.workspace = true

 [dependencies]
+arrow-schema.workspace = true
 common-base.workspace = true
 common-decimal.workspace = true
 common-error.workspace = true
 common-macro.workspace = true
+common-telemetry.workspace = true
 common-time.workspace = true
 datafusion-sql.workspace = true
 datatypes.workspace = true
--- a/src/common/sql/src/convert.rs
+++ b/src/common/sql/src/convert.rs
@@ -14,11 +14,12 @@

 use std::str::FromStr;

+use arrow_schema::extension::ExtensionType;
 use common_time::Timestamp;
 use common_time::timezone::Timezone;
-use datatypes::json::JsonStructureSettings;
+use datatypes::extension::json::JsonExtensionType;
 use datatypes::prelude::ConcreteDataType;
-use datatypes::schema::ColumnDefaultConstraint;
+use datatypes::schema::{ColumnDefaultConstraint, ColumnSchema};
 use datatypes::types::{JsonFormat, parse_string_to_jsonb, parse_string_to_vector_type_value};
 use datatypes::value::{OrderedF32, OrderedF64, Value};
 use snafu::{OptionExt, ResultExt, ensure};
@@ -124,13 +125,14 @@ pub(crate) fn sql_number_to_value(data_type: &ConcreteDataType, n: &str) -> Resu
 /// If `auto_string_to_numeric` is true, tries to cast the string value to numeric values,
 /// and returns error if the cast fails.
 pub fn sql_value_to_value(
-    column_name: &str,
-    data_type: &ConcreteDataType,
+    column_schema: &ColumnSchema,
    sql_val: &SqlValue,
    timezone: Option<&Timezone>,
    unary_op: Option<UnaryOperator>,
    auto_string_to_numeric: bool,
 ) -> Result<Value> {
+    let column_name = &column_schema.name;
+    let data_type = &column_schema.data_type;
    let mut value = match sql_val {
        SqlValue::Number(n, _) => sql_number_to_value(data_type, n)?,
        SqlValue::Null => Value::Null,
@@ -146,13 +148,9 @@ pub fn sql_value_to_value(

            (*b).into()
        }
-        SqlValue::DoubleQuotedString(s) | SqlValue::SingleQuotedString(s) => parse_string_to_value(
-            column_name,
-            s.clone(),
-            data_type,
-            timezone,
-            auto_string_to_numeric,
-        )?,
+        SqlValue::DoubleQuotedString(s) | SqlValue::SingleQuotedString(s) => {
+            parse_string_to_value(column_schema, s.clone(), timezone, auto_string_to_numeric)?
+        }
        SqlValue::HexStringLiteral(s) => {
            // Should not directly write binary into json column
            ensure!(
@@ -244,12 +242,12 @@ pub fn sql_value_to_value(
 }

 pub(crate) fn parse_string_to_value(
-    column_name: &str,
+    column_schema: &ColumnSchema,
    s: String,
-    data_type: &ConcreteDataType,
    timezone: Option<&Timezone>,
    auto_string_to_numeric: bool,
 ) -> Result<Value> {
+    let data_type = &column_schema.data_type;
    if auto_string_to_numeric && let Some(value) = auto_cast_to_numeric(&s, data_type)? {
        return Ok(value);
    }
@@ -257,7 +255,7 @@ pub(crate) fn parse_string_to_value(
    ensure!(
        data_type.is_stringifiable(),
        ColumnTypeMismatchSnafu {
-            column_name,
+            column_name: column_schema.name.clone(),
            expect: data_type.clone(),
            actual: ConcreteDataType::string_datatype(),
        }
@@ -303,23 +301,21 @@ pub(crate) fn parse_string_to_value(
            }
        }
        ConcreteDataType::Binary(_) => Ok(Value::Binary(s.as_bytes().into())),
-        ConcreteDataType::Json(j) => {
-            match &j.format {
-                JsonFormat::Jsonb => {
-                    let v = parse_string_to_jsonb(&s).context(DatatypeSnafu)?;
-                    Ok(Value::Binary(v.into()))
-                }
-                JsonFormat::Native(_inner) => {
-                    // Always use the structured version at this level.
-                    let serde_json_value =
-                        serde_json::from_str(&s).context(DeserializeSnafu { json: s })?;
-                    let json_structure_settings = JsonStructureSettings::Structured(None);
-                    json_structure_settings
-                        .encode(serde_json_value)
-                        .context(DatatypeSnafu)
-                }
+        ConcreteDataType::Json(j) => match &j.format {
+            JsonFormat::Jsonb => {
+                let v = parse_string_to_jsonb(&s).context(DatatypeSnafu)?;
+                Ok(Value::Binary(v.into()))
            }
-        }
+            JsonFormat::Native(_) => {
+                let extension_type: Option<JsonExtensionType> =
+                    column_schema.extension_type().context(DatatypeSnafu)?;
+                let json_structure_settings = extension_type
+                    .and_then(|x| x.metadata().json_structure_settings.clone())
+                    .unwrap_or_default();
+                let v = serde_json::from_str(&s).context(DeserializeSnafu { json: s })?;
+                json_structure_settings.encode(v).context(DatatypeSnafu)
+            }
+        },
        ConcreteDataType::Vector(d) => {
            let v = parse_string_to_vector_type_value(&s, Some(d.dim)).context(DatatypeSnafu)?;
            Ok(Value::Binary(v.into()))
@@ -417,305 +413,265 @@ mod test {

    use super::*;

+    macro_rules! call_parse_string_to_value {
+        ($column_name: expr, $input: expr, $data_type: expr) => {
+            call_parse_string_to_value!($column_name, $input, $data_type, None)
+        };
+        ($column_name: expr, $input: expr, $data_type: expr, timezone = $timezone: expr) => {
+            call_parse_string_to_value!($column_name, $input, $data_type, Some($timezone))
+        };
+        ($column_name: expr, $input: expr, $data_type: expr, $timezone: expr) => {{
+            let column_schema = ColumnSchema::new($column_name, $data_type, true);
+            parse_string_to_value(&column_schema, $input, $timezone, true)
+        }};
+    }
+
    #[test]
-    fn test_string_to_value_auto_numeric() {
+    fn test_string_to_value_auto_numeric() -> Result<()> {
        // Test string to boolean with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "true".to_string(),
-            &ConcreteDataType::boolean_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::boolean_datatype()
+        )?;
        assert_eq!(Value::Boolean(true), result);

        // Test invalid string to boolean with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_a_boolean".to_string(),
-            &ConcreteDataType::boolean_datatype(),
-            None,
-            true,
+            ConcreteDataType::boolean_datatype()
        );
        assert!(result.is_err());

        // Test string to int8
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "42".to_string(),
-            &ConcreteDataType::int8_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::int8_datatype()
+        )?;
        assert_eq!(Value::Int8(42), result);

        // Test invalid string to int8 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_an_int8".to_string(),
-            &ConcreteDataType::int8_datatype(),
-            None,
-            true,
+            ConcreteDataType::int8_datatype()
        );
        assert!(result.is_err());

        // Test string to int16
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "1000".to_string(),
-            &ConcreteDataType::int16_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::int16_datatype()
+        )?;
        assert_eq!(Value::Int16(1000), result);

        // Test invalid string to int16 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_an_int16".to_string(),
-            &ConcreteDataType::int16_datatype(),
-            None,
-            true,
+            ConcreteDataType::int16_datatype()
        );
        assert!(result.is_err());

        // Test string to int32
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "100000".to_string(),
-            &ConcreteDataType::int32_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::int32_datatype()
+        )?;
        assert_eq!(Value::Int32(100000), result);

        // Test invalid string to int32 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_an_int32".to_string(),
-            &ConcreteDataType::int32_datatype(),
-            None,
-            true,
+            ConcreteDataType::int32_datatype()
        );
        assert!(result.is_err());

        // Test string to int64
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "1000000".to_string(),
-            &ConcreteDataType::int64_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::int64_datatype()
+        )?;
        assert_eq!(Value::Int64(1000000), result);

        // Test invalid string to int64 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_an_int64".to_string(),
-            &ConcreteDataType::int64_datatype(),
-            None,
-            true,
+            ConcreteDataType::int64_datatype()
        );
        assert!(result.is_err());

        // Test string to uint8
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "200".to_string(),
-            &ConcreteDataType::uint8_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::uint8_datatype()
+        )?;
        assert_eq!(Value::UInt8(200), result);

        // Test invalid string to uint8 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_a_uint8".to_string(),
-            &ConcreteDataType::uint8_datatype(),
-            None,
-            true,
+            ConcreteDataType::uint8_datatype()
        );
        assert!(result.is_err());

        // Test string to uint16
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "60000".to_string(),
-            &ConcreteDataType::uint16_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::uint16_datatype()
+        )?;
        assert_eq!(Value::UInt16(60000), result);

        // Test invalid string to uint16 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_a_uint16".to_string(),
-            &ConcreteDataType::uint16_datatype(),
-            None,
-            true,
+            ConcreteDataType::uint16_datatype()
        );
        assert!(result.is_err());

        // Test string to uint32
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "4000000000".to_string(),
-            &ConcreteDataType::uint32_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::uint32_datatype()
+        )?;
        assert_eq!(Value::UInt32(4000000000), result);

        // Test invalid string to uint32 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_a_uint32".to_string(),
-            &ConcreteDataType::uint32_datatype(),
-            None,
-            true,
+            ConcreteDataType::uint32_datatype()
        );
        assert!(result.is_err());

        // Test string to uint64
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "18446744073709551615".to_string(),
-            &ConcreteDataType::uint64_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::uint64_datatype()
+        )?;
        assert_eq!(Value::UInt64(18446744073709551615), result);

        // Test invalid string to uint64 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_a_uint64".to_string(),
-            &ConcreteDataType::uint64_datatype(),
-            None,
-            true,
+            ConcreteDataType::uint64_datatype()
        );
        assert!(result.is_err());

        // Test string to float32
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "3.5".to_string(),
-            &ConcreteDataType::float32_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::float32_datatype()
+        )?;
        assert_eq!(Value::Float32(OrderedF32::from(3.5)), result);

        // Test invalid string to float32 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_a_float32".to_string(),
-            &ConcreteDataType::float32_datatype(),
-            None,
-            true,
+            ConcreteDataType::float32_datatype()
        );
        assert!(result.is_err());

        // Test string to float64
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "3.5".to_string(),
-            &ConcreteDataType::float64_datatype(),
-            None,
-            true,
-        )
-        .unwrap();
+            ConcreteDataType::float64_datatype()
+        )?;
        assert_eq!(Value::Float64(OrderedF64::from(3.5)), result);

        // Test invalid string to float64 with auto cast
-        let result = parse_string_to_value(
+        let result = call_parse_string_to_value!(
            "col",
            "not_a_float64".to_string(),
-            &ConcreteDataType::float64_datatype(),
-            None,
-            true,
+            ConcreteDataType::float64_datatype()
        );
        assert!(result.is_err());
+        Ok(())
    }

-    #[test]
-    fn test_sql_value_to_value() {
-        let sql_val = SqlValue::Null;
-        assert_eq!(
-            Value::Null,
-            sql_value_to_value(
-                "a",
-                &ConcreteDataType::float64_datatype(),
-                &sql_val,
-                None,
+    macro_rules! call_sql_value_to_value {
+        ($column_name: expr, $data_type: expr, $sql_value: expr) => {
+            call_sql_value_to_value!($column_name, $data_type, $sql_value, None, None, false)
+        };
+        ($column_name: expr, $data_type: expr, $sql_value: expr, timezone = $timezone: expr) => {
+            call_sql_value_to_value!(
+                $column_name,
+                $data_type,
+                $sql_value,
+                Some($timezone),
                None,
                false
            )
-            .unwrap()
+        };
+        ($column_name: expr, $data_type: expr, $sql_value: expr, unary_op = $unary_op: expr) => {
+            call_sql_value_to_value!(
+                $column_name,
+                $data_type,
+                $sql_value,
+                None,
+                Some($unary_op),
+                false
+            )
+        };
+        ($column_name: expr, $data_type: expr, $sql_value: expr, auto_string_to_numeric) => {
+            call_sql_value_to_value!($column_name, $data_type, $sql_value, None, None, true)
+        };
+        ($column_name: expr, $data_type: expr, $sql_value: expr, $timezone: expr, $unary_op: expr, $auto_string_to_numeric: expr) => {{
+            let column_schema = ColumnSchema::new($column_name, $data_type, true);
+            sql_value_to_value(
+                &column_schema,
+                $sql_value,
+                $timezone,
+                $unary_op,
+                $auto_string_to_numeric,
+            )
+        }};
+    }
+
+    #[test]
+    fn test_sql_value_to_value() -> Result<()> {
+        let sql_val = SqlValue::Null;
+        assert_eq!(
+            Value::Null,
+            call_sql_value_to_value!("a", ConcreteDataType::float64_datatype(), &sql_val)?
        );

        let sql_val = SqlValue::Boolean(true);
        assert_eq!(
            Value::Boolean(true),
-            sql_value_to_value(
-                "a",
-                &ConcreteDataType::boolean_datatype(),
-                &sql_val,
-                None,
-                None,
-                false
-            )
-            .unwrap()
+            call_sql_value_to_value!("a", ConcreteDataType::boolean_datatype(), &sql_val)?
        );

        let sql_val = SqlValue::Number("3.0".to_string(), false);
        assert_eq!(
            Value::Float64(OrderedFloat(3.0)),
-            sql_value_to_value(
-                "a",
-                &ConcreteDataType::float64_datatype(),
-                &sql_val,
-                None,
-                None,
-                false
-            )
-            .unwrap()
+            call_sql_value_to_value!("a", ConcreteDataType::float64_datatype(), &sql_val)?
        );

        let sql_val = SqlValue::Number("3.0".to_string(), false);
-        let v = sql_value_to_value(
-            "a",
-            &ConcreteDataType::boolean_datatype(),
-            &sql_val,
-            None,
-            None,
-            false,
-        );
+        let v = call_sql_value_to_value!("a", ConcreteDataType::boolean_datatype(), &sql_val);
        assert!(v.is_err());
        assert!(format!("{v:?}").contains("Failed to parse number '3.0' to boolean column type"));

        let sql_val = SqlValue::Boolean(true);
-        let v = sql_value_to_value(
-            "a",
-            &ConcreteDataType::float64_datatype(),
-            &sql_val,
-            None,
-            None,
-            false,
-        );
+        let v = call_sql_value_to_value!("a", ConcreteDataType::float64_datatype(), &sql_val);
        assert!(v.is_err());
        assert!(
            format!("{v:?}").contains(
@@ -725,41 +681,18 @@ mod test {
        );

        let sql_val = SqlValue::HexStringLiteral("48656c6c6f20776f726c6421".to_string());
-        let v = sql_value_to_value(
-            "a",
-            &ConcreteDataType::binary_datatype(),
-            &sql_val,
-            None,
-            None,
-            false,
-        )
-        .unwrap();
+        let v = call_sql_value_to_value!("a", ConcreteDataType::binary_datatype(), &sql_val)?;
        assert_eq!(Value::Binary(Bytes::from(b"Hello world!".as_slice())), v);

        let sql_val = SqlValue::DoubleQuotedString("MorningMyFriends".to_string());
-        let v = sql_value_to_value(
-            "a",
-            &ConcreteDataType::binary_datatype(),
-            &sql_val,
-            None,
-            None,
-            false,
-        )
-        .unwrap();
+        let v = call_sql_value_to_value!("a", ConcreteDataType::binary_datatype(), &sql_val)?;
        assert_eq!(
            Value::Binary(Bytes::from(b"MorningMyFriends".as_slice())),
            v
        );

        let sql_val = SqlValue::HexStringLiteral("9AF".to_string());
-        let v = sql_value_to_value(
-            "a",
-            &ConcreteDataType::binary_datatype(),
-            &sql_val,
-            None,
-            None,
-            false,
-        );
+        let v = call_sql_value_to_value!("a", ConcreteDataType::binary_datatype(), &sql_val);
        assert!(v.is_err());
        assert!(
            format!("{v:?}").contains("odd number of digits"),
@@ -767,38 +700,16 @@ mod test {
        );

        let sql_val = SqlValue::HexStringLiteral("AG".to_string());
-        let v = sql_value_to_value(
-            "a",
-            &ConcreteDataType::binary_datatype(),
-            &sql_val,
-            None,
-            None,
-            false,
-        );
+        let v = call_sql_value_to_value!("a", ConcreteDataType::binary_datatype(), &sql_val);
        assert!(v.is_err());
        assert!(format!("{v:?}").contains("invalid character"), "v is {v:?}",);

        let sql_val = SqlValue::DoubleQuotedString("MorningMyFriends".to_string());
-        let v = sql_value_to_value(
-            "a",
-            &ConcreteDataType::json_datatype(),
-            &sql_val,
-            None,
-            None,
-            false,
-        );
+        let v = call_sql_value_to_value!("a", ConcreteDataType::json_datatype(), &sql_val);
        assert!(v.is_err());

        let sql_val = SqlValue::DoubleQuotedString(r#"{"a":"b"}"#.to_string());
-        let v = sql_value_to_value(
-            "a",
-            &ConcreteDataType::json_datatype(),
-            &sql_val,
-            None,
-            None,
-            false,
-        )
-        .unwrap();
+        let v = call_sql_value_to_value!("a", ConcreteDataType::json_datatype(), &sql_val)?;
        assert_eq!(
            Value::Binary(Bytes::from(
                jsonb::parse_value(r#"{"a":"b"}"#.as_bytes())
@@ -808,16 +719,15 @@ mod test {
            )),
            v
        );
+        Ok(())
    }

    #[test]
    fn test_parse_json_to_jsonb() {
-        match parse_string_to_value(
+        match call_parse_string_to_value!(
            "json_col",
            r#"{"a": "b"}"#.to_string(),
-            &ConcreteDataType::json_datatype(),
-            None,
-            false,
+            ConcreteDataType::json_datatype()
        ) {
            Ok(Value::Binary(b)) => {
                assert_eq!(
@@ -833,12 +743,10 @@ mod test {
        }

        assert!(
-            parse_string_to_value(
+            call_parse_string_to_value!(
                "json_col",
                r#"Nicola Kovac is the best rifler in the world"#.to_string(),
-                &ConcreteDataType::json_datatype(),
-                None,
-                false,
+                ConcreteDataType::json_datatype()
            )
            .is_err()
        )
@@ -878,13 +786,10 @@ mod test {

    #[test]
    fn test_parse_date_literal() {
-        let value = sql_value_to_value(
+        let value = call_sql_value_to_value!(
            "date",
-            &ConcreteDataType::date_datatype(),
-            &SqlValue::DoubleQuotedString("2022-02-22".to_string()),
-            None,
-            None,
-            false,
+            ConcreteDataType::date_datatype(),
+            &SqlValue::DoubleQuotedString("2022-02-22".to_string())
        )
        .unwrap();
        assert_eq!(ConcreteDataType::date_datatype(), value.data_type());
@@ -895,13 +800,11 @@ mod test {
        }

        // with timezone
-        let value = sql_value_to_value(
+        let value = call_sql_value_to_value!(
            "date",
-            &ConcreteDataType::date_datatype(),
+            ConcreteDataType::date_datatype(),
            &SqlValue::DoubleQuotedString("2022-02-22".to_string()),
-            Some(&Timezone::from_tz_string("+07:00").unwrap()),
-            None,
-            false,
+            timezone = &Timezone::from_tz_string("+07:00").unwrap()
        )
        .unwrap();
        assert_eq!(ConcreteDataType::date_datatype(), value.data_type());
@@ -913,16 +816,12 @@ mod test {
    }

    #[test]
-    fn test_parse_timestamp_literal() {
-        match parse_string_to_value(
+    fn test_parse_timestamp_literal() -> Result<()> {
+        match call_parse_string_to_value!(
            "timestamp_col",
            "2022-02-22T00:01:01+08:00".to_string(),
-            &ConcreteDataType::timestamp_millisecond_datatype(),
-            None,
-            false,
-        )
-        .unwrap()
-        {
+            ConcreteDataType::timestamp_millisecond_datatype()
+        )? {
            Value::Timestamp(ts) => {
                assert_eq!(1645459261000, ts.value());
                assert_eq!(TimeUnit::Millisecond, ts.unit());
@@ -932,15 +831,11 @@ mod test {
            }
        }

-        match parse_string_to_value(
+        match call_parse_string_to_value!(
            "timestamp_col",
            "2022-02-22T00:01:01+08:00".to_string(),
-            &ConcreteDataType::timestamp_datatype(TimeUnit::Second),
-            None,
-            false,
-        )
-        .unwrap()
-        {
+            ConcreteDataType::timestamp_datatype(TimeUnit::Second)
+        )? {
            Value::Timestamp(ts) => {
                assert_eq!(1645459261, ts.value());
                assert_eq!(TimeUnit::Second, ts.unit());
@@ -950,15 +845,11 @@ mod test {
            }
        }

-        match parse_string_to_value(
+        match call_parse_string_to_value!(
            "timestamp_col",
            "2022-02-22T00:01:01+08:00".to_string(),
-            &ConcreteDataType::timestamp_datatype(TimeUnit::Microsecond),
-            None,
-            false,
-        )
-        .unwrap()
-        {
+            ConcreteDataType::timestamp_datatype(TimeUnit::Microsecond)
+        )? {
            Value::Timestamp(ts) => {
                assert_eq!(1645459261000000, ts.value());
                assert_eq!(TimeUnit::Microsecond, ts.unit());
@@ -968,15 +859,11 @@ mod test {
            }
        }

-        match parse_string_to_value(
+        match call_parse_string_to_value!(
            "timestamp_col",
            "2022-02-22T00:01:01+08:00".to_string(),
-            &ConcreteDataType::timestamp_datatype(TimeUnit::Nanosecond),
-            None,
-            false,
-        )
-        .unwrap()
-        {
+            ConcreteDataType::timestamp_datatype(TimeUnit::Nanosecond)
+        )? {
            Value::Timestamp(ts) => {
                assert_eq!(1645459261000000000, ts.value());
                assert_eq!(TimeUnit::Nanosecond, ts.unit());
@@ -987,26 +874,21 @@ mod test {
        }

        assert!(
-            parse_string_to_value(
+            call_parse_string_to_value!(
                "timestamp_col",
                "2022-02-22T00:01:01+08".to_string(),
-                &ConcreteDataType::timestamp_datatype(TimeUnit::Nanosecond),
-                None,
-                false,
+                ConcreteDataType::timestamp_datatype(TimeUnit::Nanosecond)
            )
            .is_err()
        );

        // with timezone
-        match parse_string_to_value(
+        match call_parse_string_to_value!(
            "timestamp_col",
            "2022-02-22T00:01:01".to_string(),
-            &ConcreteDataType::timestamp_datatype(TimeUnit::Nanosecond),
-            Some(&Timezone::from_tz_string("Asia/Shanghai").unwrap()),
-            false,
-        )
-        .unwrap()
-        {
+            ConcreteDataType::timestamp_datatype(TimeUnit::Nanosecond),
+            timezone = &Timezone::from_tz_string("Asia/Shanghai").unwrap()
+        )? {
            Value::Timestamp(ts) => {
                assert_eq!(1645459261000000000, ts.value());
                assert_eq!("2022-02-21 16:01:01+0000", ts.to_iso8601_string());
@@ -1016,51 +898,42 @@ mod test {
                unreachable!()
            }
        }
+        Ok(())
    }

    #[test]
    fn test_parse_placeholder_value() {
        assert!(
-            sql_value_to_value(
+            call_sql_value_to_value!(
                "test",
-                &ConcreteDataType::string_datatype(),
+                ConcreteDataType::string_datatype(),
+                &SqlValue::Placeholder("default".into())
+            )
+            .is_err()
+        );
+        assert!(
+            call_sql_value_to_value!(
+                "test",
+                ConcreteDataType::string_datatype(),
                &SqlValue::Placeholder("default".into()),
-                None,
-                None,
-                false
+                unary_op = UnaryOperator::Minus
            )
            .is_err()
        );
        assert!(
-            sql_value_to_value(
+            call_sql_value_to_value!(
                "test",
-                &ConcreteDataType::string_datatype(),
-                &SqlValue::Placeholder("default".into()),
-                None,
-                Some(UnaryOperator::Minus),
-                false
-            )
-            .is_err()
-        );
-        assert!(
-            sql_value_to_value(
-                "test",
-                &ConcreteDataType::uint16_datatype(),
+                ConcreteDataType::uint16_datatype(),
                &SqlValue::Number("3".into(), false),
-                None,
-                Some(UnaryOperator::Minus),
-                false
+                unary_op = UnaryOperator::Minus
            )
            .is_err()
        );
        assert!(
-            sql_value_to_value(
+            call_sql_value_to_value!(
                "test",
-                &ConcreteDataType::uint16_datatype(),
-                &SqlValue::Number("3".into(), false),
-                None,
-                None,
-                false
+                ConcreteDataType::uint16_datatype(),
+                &SqlValue::Number("3".into(), false)
            )
            .is_ok()
        );
@@ -1070,77 +943,60 @@ mod test {
    fn test_auto_string_to_numeric() {
        // Test with auto_string_to_numeric=true
        let sql_val = SqlValue::SingleQuotedString("123".to_string());
-        let v = sql_value_to_value(
+        let v = call_sql_value_to_value!(
            "a",
-            &ConcreteDataType::int32_datatype(),
+            ConcreteDataType::int32_datatype(),
            &sql_val,
-            None,
-            None,
-            true,
+            auto_string_to_numeric
        )
        .unwrap();
        assert_eq!(Value::Int32(123), v);

        // Test with a float string
        let sql_val = SqlValue::SingleQuotedString("3.5".to_string());
-        let v = sql_value_to_value(
+        let v = call_sql_value_to_value!(
            "a",
-            &ConcreteDataType::float64_datatype(),
+            ConcreteDataType::float64_datatype(),
            &sql_val,
-            None,
-            None,
-            true,
+            auto_string_to_numeric
        )
        .unwrap();
        assert_eq!(Value::Float64(OrderedFloat(3.5)), v);

        // Test with auto_string_to_numeric=false
        let sql_val = SqlValue::SingleQuotedString("123".to_string());
-        let v = sql_value_to_value(
-            "a",
-            &ConcreteDataType::int32_datatype(),
-            &sql_val,
-            None,
-            None,
-            false,
-        );
+        let v = call_sql_value_to_value!("a", ConcreteDataType::int32_datatype(), &sql_val);
        assert!(v.is_err());

        // Test with an invalid numeric string but auto_string_to_numeric=true
        // Should return an error now with the new auto_cast_to_numeric behavior
        let sql_val = SqlValue::SingleQuotedString("not_a_number".to_string());
-        let v = sql_value_to_value(
+        let v = call_sql_value_to_value!(
            "a",
-            &ConcreteDataType::int32_datatype(),
+            ConcreteDataType::int32_datatype(),
            &sql_val,
-            None,
-            None,
-            true,
+            auto_string_to_numeric
        );
        assert!(v.is_err());

        // Test with boolean type
        let sql_val = SqlValue::SingleQuotedString("true".to_string());
-        let v = sql_value_to_value(
+        let v = call_sql_value_to_value!(
            "a",
-            &ConcreteDataType::boolean_datatype(),
+            ConcreteDataType::boolean_datatype(),
            &sql_val,
-            None,
-            None,
-            true,
+            auto_string_to_numeric
        )
        .unwrap();
        assert_eq!(Value::Boolean(true), v);

        // Non-numeric types should still be handled normally
        let sql_val = SqlValue::SingleQuotedString("hello".to_string());
-        let v = sql_value_to_value(
+        let v = call_sql_value_to_value!(
            "a",
-            &ConcreteDataType::string_datatype(),
+            ConcreteDataType::string_datatype(),
            &sql_val,
-            None,
-            None,
-            true,
+            auto_string_to_numeric
        );
        assert!(v.is_ok());
    }
--- a/src/common/sql/src/default_constraint.rs
+++ b/src/common/sql/src/default_constraint.rs
@@ -14,8 +14,8 @@

 use common_time::timezone::Timezone;
 use datatypes::prelude::ConcreteDataType;
-use datatypes::schema::ColumnDefaultConstraint;
 use datatypes::schema::constraint::{CURRENT_TIMESTAMP, CURRENT_TIMESTAMP_FN};
+use datatypes::schema::{ColumnDefaultConstraint, ColumnSchema};
 use snafu::ensure;
 use sqlparser::ast::ValueWithSpan;
 pub use sqlparser::ast::{
@@ -47,9 +47,12 @@ pub fn parse_column_default_constraint(
        );

        let default_constraint = match &opt.option {
-            ColumnOption::Default(Expr::Value(v)) => ColumnDefaultConstraint::Value(
-                sql_value_to_value(column_name, data_type, &v.value, timezone, None, false)?,
-            ),
+            ColumnOption::Default(Expr::Value(v)) => {
+                let schema = ColumnSchema::new(column_name, data_type.clone(), true);
+                ColumnDefaultConstraint::Value(sql_value_to_value(
+                    &schema, &v.value, timezone, None, false,
+                )?)
+            }
            ColumnOption::Default(Expr::Function(func)) => {
                let mut func = format!("{func}").to_lowercase();
                // normalize CURRENT_TIMESTAMP to CURRENT_TIMESTAMP()
@@ -80,8 +83,7 @@ pub fn parse_column_default_constraint(

                if let Expr::Value(v) = &**expr {
                    let value = sql_value_to_value(
-                        column_name,
-                        data_type,
+                        &ColumnSchema::new(column_name, data_type.clone(), true),
                        &v.value,
                        timezone,
                        Some(*op),
--- a/src/datanode/src/heartbeat/handler.rs
+++ b/src/datanode/src/heartbeat/handler.rs
@@ -24,6 +24,7 @@ use store_api::storage::GcReport;

 mod close_region;
 mod downgrade_region;
+mod enter_staging;
 mod file_ref;
 mod flush_region;
 mod gc_worker;
@@ -32,6 +33,7 @@ mod upgrade_region;

 use crate::heartbeat::handler::close_region::CloseRegionsHandler;
 use crate::heartbeat::handler::downgrade_region::DowngradeRegionsHandler;
+use crate::heartbeat::handler::enter_staging::EnterStagingRegionsHandler;
 use crate::heartbeat::handler::file_ref::GetFileRefsHandler;
 use crate::heartbeat::handler::flush_region::FlushRegionsHandler;
 use crate::heartbeat::handler::gc_worker::GcRegionsHandler;
@@ -123,6 +125,9 @@ impl RegionHeartbeatResponseHandler {
            Instruction::GcRegions(_) => Ok(Some(Box::new(GcRegionsHandler.into()))),
            Instruction::InvalidateCaches(_) => InvalidHeartbeatResponseSnafu.fail(),
            Instruction::Suspend => Ok(None),
+            Instruction::EnterStagingRegions(_) => {
+                Ok(Some(Box::new(EnterStagingRegionsHandler.into())))
+            }
        }
    }
 }
@@ -136,6 +141,7 @@ pub enum InstructionHandlers {
    UpgradeRegions(UpgradeRegionsHandler),
    GetFileRefs(GetFileRefsHandler),
    GcRegions(GcRegionsHandler),
+    EnterStagingRegions(EnterStagingRegionsHandler),
 }

 macro_rules! impl_from_handler {
@@ -157,7 +163,8 @@ impl_from_handler!(
    DowngradeRegionsHandler => DowngradeRegions,
    UpgradeRegionsHandler => UpgradeRegions,
    GetFileRefsHandler => GetFileRefs,
-    GcRegionsHandler => GcRegions
+    GcRegionsHandler => GcRegions,
+    EnterStagingRegionsHandler => EnterStagingRegions
 );

 macro_rules! dispatch_instr {
@@ -202,6 +209,7 @@ dispatch_instr!(
    UpgradeRegions => UpgradeRegions,
    GetFileRefs => GetFileRefs,
    GcRegions => GcRegions,
+    EnterStagingRegions => EnterStagingRegions
 );

 #[async_trait]
@@ -254,7 +262,9 @@ mod tests {
    use common_meta::heartbeat::mailbox::{
        HeartbeatMailbox, IncomingMessage, MailboxRef, MessageMeta,
    };
-    use common_meta::instruction::{DowngradeRegion, OpenRegion, UpgradeRegion};
+    use common_meta::instruction::{
+        DowngradeRegion, EnterStagingRegion, OpenRegion, UpgradeRegion,
+    };
    use mito2::config::MitoConfig;
    use mito2::engine::MITO_ENGINE_NAME;
    use mito2::test_util::{CreateRequestBuilder, TestEnv};
@@ -335,6 +345,16 @@ mod tests {
            region_id,
            ..Default::default()
        }]);
+        assert!(
+            heartbeat_handler
+                .is_acceptable(&heartbeat_env.create_handler_ctx((meta.clone(), instruction)))
+        );
+
+        // Enter staging region
+        let instruction = Instruction::EnterStagingRegions(vec![EnterStagingRegion {
+            region_id,
+            partition_expr: "".to_string(),
+        }]);
        assert!(
            heartbeat_handler.is_acceptable(&heartbeat_env.create_handler_ctx((meta, instruction)))
        );
--- a/src/datanode/src/heartbeat/handler/enter_staging.rs
+++ b/src/datanode/src/heartbeat/handler/enter_staging.rs
@@ -0,0 +1,243 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use common_meta::instruction::{
+    EnterStagingRegion, EnterStagingRegionReply, EnterStagingRegionsReply, InstructionReply,
+};
+use common_telemetry::{error, warn};
+use futures::future::join_all;
+use store_api::region_request::{EnterStagingRequest, RegionRequest};
+
+use crate::heartbeat::handler::{HandlerContext, InstructionHandler};
+
+#[derive(Debug, Clone, Copy, Default)]
+pub struct EnterStagingRegionsHandler;
+
+#[async_trait::async_trait]
+impl InstructionHandler for EnterStagingRegionsHandler {
+    type Instruction = Vec<EnterStagingRegion>;
+
+    async fn handle(
+        &self,
+        ctx: &HandlerContext,
+        enter_staging: Self::Instruction,
+    ) -> Option<InstructionReply> {
+        let futures = enter_staging.into_iter().map(|enter_staging_region| {
+            Self::handle_enter_staging_region(ctx, enter_staging_region)
+        });
+        let results = join_all(futures).await;
+        Some(InstructionReply::EnterStagingRegions(
+            EnterStagingRegionsReply::new(results),
+        ))
+    }
+}
+
+impl EnterStagingRegionsHandler {
+    async fn handle_enter_staging_region(
+        ctx: &HandlerContext,
+        EnterStagingRegion {
+            region_id,
+            partition_expr,
+        }: EnterStagingRegion,
+    ) -> EnterStagingRegionReply {
+        let Some(writable) = ctx.region_server.is_region_leader(region_id) else {
+            warn!("Region: {} is not found", region_id);
+            return EnterStagingRegionReply {
+                region_id,
+                ready: false,
+                exists: false,
+                error: None,
+            };
+        };
+        if !writable {
+            warn!("Region: {} is not writable", region_id);
+            return EnterStagingRegionReply {
+                region_id,
+                ready: false,
+                exists: true,
+                error: Some("Region is not writable".into()),
+            };
+        }
+
+        match ctx
+            .region_server
+            .handle_request(
+                region_id,
+                RegionRequest::EnterStaging(EnterStagingRequest { partition_expr }),
+            )
+            .await
+        {
+            Ok(_) => EnterStagingRegionReply {
+                region_id,
+                ready: true,
+                exists: true,
+                error: None,
+            },
+            Err(err) => {
+                error!(err; "Failed to enter staging region");
+                EnterStagingRegionReply {
+                    region_id,
+                    ready: false,
+                    exists: true,
+                    error: Some(format!("{err:?}")),
+                }
+            }
+        }
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::sync::Arc;
+
+    use common_meta::instruction::EnterStagingRegion;
+    use mito2::config::MitoConfig;
+    use mito2::engine::MITO_ENGINE_NAME;
+    use mito2::test_util::{CreateRequestBuilder, TestEnv};
+    use store_api::path_utils::table_dir;
+    use store_api::region_engine::RegionRole;
+    use store_api::region_request::RegionRequest;
+    use store_api::storage::RegionId;
+
+    use crate::heartbeat::handler::enter_staging::EnterStagingRegionsHandler;
+    use crate::heartbeat::handler::{HandlerContext, InstructionHandler};
+    use crate::region_server::RegionServer;
+    use crate::tests::{MockRegionEngine, mock_region_server};
+
+    const PARTITION_EXPR: &str = "partition_expr";
+
+    #[tokio::test]
+    async fn test_region_not_exist() {
+        let mut mock_region_server = mock_region_server();
+        let (mock_engine, _) = MockRegionEngine::new(MITO_ENGINE_NAME);
+        mock_region_server.register_engine(mock_engine);
+        let handler_context = HandlerContext::new_for_test(mock_region_server);
+        let region_id = RegionId::new(1024, 1);
+        let replies = EnterStagingRegionsHandler
+            .handle(
+                &handler_context,
+                vec![EnterStagingRegion {
+                    region_id,
+                    partition_expr: "".to_string(),
+                }],
+            )
+            .await
+            .unwrap();
+        let replies = replies.expect_enter_staging_regions_reply();
+        let reply = &replies[0];
+        assert!(!reply.exists);
+        assert!(reply.error.is_none());
+        assert!(!reply.ready);
+    }
+
+    #[tokio::test]
+    async fn test_region_not_writable() {
+        let mock_region_server = mock_region_server();
+        let region_id = RegionId::new(1024, 1);
+        let (mock_engine, _) =
+            MockRegionEngine::with_custom_apply_fn(MITO_ENGINE_NAME, |region_engine| {
+                region_engine.mock_role = Some(Some(RegionRole::Follower));
+                region_engine.handle_request_mock_fn = Some(Box::new(|_, _| Ok(0)));
+            });
+        mock_region_server.register_test_region(region_id, mock_engine);
+        let handler_context = HandlerContext::new_for_test(mock_region_server);
+        let replies = EnterStagingRegionsHandler
+            .handle(
+                &handler_context,
+                vec![EnterStagingRegion {
+                    region_id,
+                    partition_expr: "".to_string(),
+                }],
+            )
+            .await
+            .unwrap();
+        let replies = replies.expect_enter_staging_regions_reply();
+        let reply = &replies[0];
+        assert!(reply.exists);
+        assert!(reply.error.is_some());
+        assert!(!reply.ready);
+    }
+
+    async fn prepare_region(region_server: &RegionServer) {
+        let builder = CreateRequestBuilder::new();
+        let mut create_req = builder.build();
+        create_req.table_dir = table_dir("test", 1024);
+        let region_id = RegionId::new(1024, 1);
+        region_server
+            .handle_request(region_id, RegionRequest::Create(create_req))
+            .await
+            .unwrap();
+    }
+
+    #[tokio::test]
+    async fn test_enter_staging() {
+        let mut region_server = mock_region_server();
+        let region_id = RegionId::new(1024, 1);
+        let mut engine_env = TestEnv::new().await;
+        let engine = engine_env.create_engine(MitoConfig::default()).await;
+        region_server.register_engine(Arc::new(engine.clone()));
+        prepare_region(&region_server).await;
+
+        let handler_context = HandlerContext::new_for_test(region_server);
+        let replies = EnterStagingRegionsHandler
+            .handle(
+                &handler_context,
+                vec![EnterStagingRegion {
+                    region_id,
+                    partition_expr: PARTITION_EXPR.to_string(),
+                }],
+            )
+            .await
+            .unwrap();
+        let replies = replies.expect_enter_staging_regions_reply();
+        let reply = &replies[0];
+        assert!(reply.exists);
+        assert!(reply.error.is_none());
+        assert!(reply.ready);
+
+        // Should be ok to enter staging mode again with the same partition expr
+        let replies = EnterStagingRegionsHandler
+            .handle(
+                &handler_context,
+                vec![EnterStagingRegion {
+                    region_id,
+                    partition_expr: PARTITION_EXPR.to_string(),
+                }],
+            )
+            .await
+            .unwrap();
+        let replies = replies.expect_enter_staging_regions_reply();
+        let reply = &replies[0];
+        assert!(reply.exists);
+        assert!(reply.error.is_none());
+        assert!(reply.ready);
+
+        // Should throw error if try to enter staging mode again with a different partition expr
+        let replies = EnterStagingRegionsHandler
+            .handle(
+                &handler_context,
+                vec![EnterStagingRegion {
+                    region_id,
+                    partition_expr: "".to_string(),
+                }],
+            )
+            .await
+            .unwrap();
+        let replies = replies.expect_enter_staging_regions_reply();
+        let reply = &replies[0];
+        assert!(reply.exists);
+        assert!(reply.error.is_some());
+        assert!(!reply.ready);
+    }
+}
--- a/src/datanode/src/region_server.rs
+++ b/src/datanode/src/region_server.rs
@@ -66,7 +66,7 @@ use store_api::metric_engine_consts::{
 };
 use store_api::region_engine::{
    RegionEngineRef, RegionManifestInfo, RegionRole, RegionStatistic, SetRegionRoleStateResponse,
-    SettableRegionRoleState,
+    SettableRegionRoleState, SyncRegionFromRequest,
 };
 use store_api::region_request::{
    AffectedRows, BatchRegionDdlRequest, RegionCatchupRequest, RegionCloseRequest,
@@ -536,10 +536,13 @@ impl RegionServer {
        let tracing_context = TracingContext::from_current_span();
        let span = tracing_context.attach(info_span!("RegionServer::handle_sync_region_request"));

-        self.sync_region(region_id, manifest_info)
-            .trace(span)
-            .await
-            .map(|_| RegionResponse::new(AffectedRows::default()))
+        self.sync_region(
+            region_id,
+            SyncRegionFromRequest::from_manifest(manifest_info),
+        )
+        .trace(span)
+        .await
+        .map(|_| RegionResponse::new(AffectedRows::default()))
    }

    /// Handles the ListMetadata request and retrieves metadata for specified regions.
@@ -588,7 +591,7 @@ impl RegionServer {
    pub async fn sync_region(
        &self,
        region_id: RegionId,
-        manifest_info: RegionManifestInfo,
+        request: SyncRegionFromRequest,
    ) -> Result<()> {
        let engine_with_status = self
            .inner
@@ -597,7 +600,7 @@ impl RegionServer {
            .with_context(|| RegionNotFoundSnafu { region_id })?;

        self.inner
-            .handle_sync_region(engine_with_status.engine(), region_id, manifest_info)
+            .handle_sync_region(engine_with_status.engine(), region_id, request)
            .await
    }

@@ -1216,7 +1219,8 @@ impl RegionServerInner {
            | RegionRequest::Compact(_)
            | RegionRequest::Truncate(_)
            | RegionRequest::BuildIndex(_)
-            | RegionRequest::EnterStaging(_) => RegionChange::None,
+            | RegionRequest::EnterStaging(_)
+            | RegionRequest::ApplyStagingManifest(_) => RegionChange::None,
            RegionRequest::Catchup(_) => RegionChange::Catchup,
        };

@@ -1268,10 +1272,10 @@ impl RegionServerInner {
        &self,
        engine: &RegionEngineRef,
        region_id: RegionId,
-        manifest_info: RegionManifestInfo,
+        request: SyncRegionFromRequest,
    ) -> Result<()> {
        let Some(new_opened_regions) = engine
-            .sync_region(region_id, manifest_info)
+            .sync_region(region_id, request)
            .await
            .with_context(|_| HandleRegionRequestSnafu { region_id })?
            .new_opened_logical_region_ids()
--- a/src/datanode/src/tests.rs
+++ b/src/datanode/src/tests.rs
@@ -33,9 +33,9 @@ use servers::grpc::FlightCompression;
 use session::context::QueryContextRef;
 use store_api::metadata::RegionMetadataRef;
 use store_api::region_engine::{
-    CopyRegionFromRequest, CopyRegionFromResponse, RegionEngine, RegionManifestInfo, RegionRole,
-    RegionScannerRef, RegionStatistic, RemapManifestsRequest, RemapManifestsResponse,
-    SetRegionRoleStateResponse, SettableRegionRoleState, SyncManifestResponse,
+    RegionEngine, RegionRole, RegionScannerRef, RegionStatistic, RemapManifestsRequest,
+    RemapManifestsResponse, SetRegionRoleStateResponse, SettableRegionRoleState,
+    SyncRegionFromRequest, SyncRegionFromResponse,
 };
 use store_api::region_request::{AffectedRows, RegionRequest};
 use store_api::storage::{RegionId, ScanRequest, SequenceNumber};
@@ -287,8 +287,8 @@ impl RegionEngine for MockRegionEngine {
    async fn sync_region(
        &self,
        _region_id: RegionId,
-        _manifest_info: RegionManifestInfo,
-    ) -> Result<SyncManifestResponse, BoxedError> {
+        _request: SyncRegionFromRequest,
+    ) -> Result<SyncRegionFromResponse, BoxedError> {
        unimplemented!()
    }

@@ -299,14 +299,6 @@ impl RegionEngine for MockRegionEngine {
        unimplemented!()
    }

-    async fn copy_region_from(
-        &self,
-        _region_id: RegionId,
-        _request: CopyRegionFromRequest,
-    ) -> Result<CopyRegionFromResponse, BoxedError> {
-        unimplemented!()
-    }
-
    fn as_any(&self) -> &dyn Any {
        self
    }
--- a/src/datatypes/src/json.rs
+++ b/src/datatypes/src/json.rs
@@ -26,9 +26,9 @@ use std::sync::Arc;

 use serde::{Deserialize, Serialize};
 use serde_json::{Map, Value as Json};
-use snafu::{ResultExt, ensure};
+use snafu::{OptionExt, ResultExt, ensure};

-use crate::error::{self, Error};
+use crate::error::{self, InvalidJsonSnafu, Result, SerializeSnafu};
 use crate::json::value::{JsonValue, JsonVariant};
 use crate::types::json_type::{JsonNativeType, JsonNumberType, JsonObjectType};
 use crate::types::{StructField, StructType};
@@ -71,7 +71,7 @@ impl JsonStructureSettings {
    pub const RAW_FIELD: &'static str = "_raw";

    /// Decode an encoded StructValue back into a serde_json::Value.
-    pub fn decode(&self, value: Value) -> Result<Json, Error> {
+    pub fn decode(&self, value: Value) -> Result<Json> {
        let context = JsonContext {
            key_path: String::new(),
            settings: self,
@@ -82,7 +82,7 @@ impl JsonStructureSettings {
    /// Decode a StructValue that was encoded with current settings back into a fully structured StructValue.
    /// This is useful for reconstructing the original structure from encoded data, especially when
    /// unstructured encoding was used for some fields.
-    pub fn decode_struct(&self, struct_value: StructValue) -> Result<StructValue, Error> {
+    pub fn decode_struct(&self, struct_value: StructValue) -> Result<StructValue> {
        let context = JsonContext {
            key_path: String::new(),
            settings: self,
@@ -91,7 +91,11 @@ impl JsonStructureSettings {
    }

    /// Encode a serde_json::Value into a Value::Json using current settings.
-    pub fn encode(&self, json: Json) -> Result<Value, Error> {
+    pub fn encode(&self, json: Json) -> Result<Value> {
+        if let Some(json_struct) = self.json_struct() {
+            return encode_by_struct(json_struct, json);
+        }
+
        let context = JsonContext {
            key_path: String::new(),
            settings: self,
@@ -104,13 +108,21 @@ impl JsonStructureSettings {
        &self,
        json: Json,
        data_type: Option<&JsonNativeType>,
-    ) -> Result<Value, Error> {
+    ) -> Result<Value> {
        let context = JsonContext {
            key_path: String::new(),
            settings: self,
        };
        encode_json_with_context(json, data_type, &context).map(|v| Value::Json(Box::new(v)))
    }
+
+    fn json_struct(&self) -> Option<&StructType> {
+        match &self {
+            JsonStructureSettings::Structured(fields) => fields.as_ref(),
+            JsonStructureSettings::PartialUnstructuredByKey { fields, .. } => fields.as_ref(),
+            _ => None,
+        }
+    }
 }

 impl Default for JsonStructureSettings {
@@ -144,12 +156,54 @@ impl<'a> JsonContext<'a> {
    }
 }

+fn encode_by_struct(json_struct: &StructType, mut json: Json) -> Result<Value> {
+    let Some(json_object) = json.as_object_mut() else {
+        return InvalidJsonSnafu {
+            value: "expect JSON object when struct is provided",
+        }
+        .fail();
+    };
+    let mut encoded = BTreeMap::new();
+
+    fn extract_field(json_object: &mut Map<String, Json>, field: &str) -> Result<Option<Json>> {
+        let (first, rest) = field.split_once('.').unwrap_or((field, ""));
+
+        if rest.is_empty() {
+            Ok(json_object.remove(first))
+        } else {
+            let Some(value) = json_object.get_mut(first) else {
+                return Ok(None);
+            };
+            let json_object = value.as_object_mut().with_context(|| InvalidJsonSnafu {
+                value: format!(r#"expect "{}" an object"#, first),
+            })?;
+            extract_field(json_object, rest)
+        }
+    }
+
+    let fields = json_struct.fields();
+    for field in fields.iter() {
+        let Some(field_value) = extract_field(json_object, field.name())? else {
+            continue;
+        };
+        let field_type: JsonNativeType = field.data_type().into();
+        let field_value = try_convert_to_expected_type(field_value, &field_type)?;
+        encoded.insert(field.name().to_string(), field_value);
+    }
+
+    let rest = serde_json::to_string(json_object).context(SerializeSnafu)?;
+    encoded.insert(JsonStructureSettings::RAW_FIELD.to_string(), rest.into());
+
+    let value: JsonValue = encoded.into();
+    Ok(Value::Json(Box::new(value)))
+}
+
 /// Main encoding function with key path tracking
 pub fn encode_json_with_context<'a>(
    json: Json,
    data_type: Option<&JsonNativeType>,
    context: &JsonContext<'a>,
-) -> Result<JsonValue, Error> {
+) -> Result<JsonValue> {
    // Check if the entire encoding should be unstructured
    if matches!(context.settings, JsonStructureSettings::UnstructuredRaw) {
        let json_string = json.to_string();
@@ -215,7 +269,7 @@ fn encode_json_object_with_context<'a>(
    mut json_object: Map<String, Json>,
    fields: Option<&JsonObjectType>,
    context: &JsonContext<'a>,
-) -> Result<JsonValue, Error> {
+) -> Result<JsonValue> {
    let mut object = BTreeMap::new();
    // First, process fields from the provided schema in their original order
    if let Some(fields) = fields {
@@ -248,7 +302,7 @@ fn encode_json_array_with_context<'a>(
    json_array: Vec<Json>,
    item_type: Option<&JsonNativeType>,
    context: &JsonContext<'a>,
-) -> Result<JsonValue, Error> {
+) -> Result<JsonValue> {
    let json_array_len = json_array.len();
    let mut items = Vec::with_capacity(json_array_len);
    let mut element_type = item_type.cloned();
@@ -286,7 +340,7 @@ fn encode_json_value_with_context<'a>(
    json: Json,
    expected_type: Option<&JsonNativeType>,
    context: &JsonContext<'a>,
-) -> Result<JsonValue, Error> {
+) -> Result<JsonValue> {
    // Check if current key should be treated as unstructured
    if context.is_unstructured_key() {
        return Ok(json.to_string().into());
@@ -301,7 +355,7 @@ fn encode_json_value_with_context<'a>(
                if let Some(expected) = expected_type
                    && let Ok(value) = try_convert_to_expected_type(i, expected)
                {
-                    return Ok(value);
+                    return Ok(value.into());
                }
                Ok(i.into())
            } else if let Some(u) = n.as_u64() {
@@ -309,7 +363,7 @@ fn encode_json_value_with_context<'a>(
                if let Some(expected) = expected_type
                    && let Ok(value) = try_convert_to_expected_type(u, expected)
                {
-                    return Ok(value);
+                    return Ok(value.into());
                }
                if u <= i64::MAX as u64 {
                    Ok((u as i64).into())
@@ -321,7 +375,7 @@ fn encode_json_value_with_context<'a>(
                if let Some(expected) = expected_type
                    && let Ok(value) = try_convert_to_expected_type(f, expected)
                {
-                    return Ok(value);
+                    return Ok(value.into());
                }

                // Default to f64 for floating point numbers
@@ -335,7 +389,7 @@ fn encode_json_value_with_context<'a>(
            if let Some(expected) = expected_type
                && let Ok(value) = try_convert_to_expected_type(s.as_str(), expected)
            {
-                return Ok(value);
+                return Ok(value.into());
            }
            Ok(s.into())
        }
@@ -345,10 +399,7 @@ fn encode_json_value_with_context<'a>(
 }

 /// Main decoding function with key path tracking
-pub fn decode_value_with_context<'a>(
-    value: Value,
-    context: &JsonContext<'a>,
-) -> Result<Json, Error> {
+pub fn decode_value_with_context(value: Value, context: &JsonContext) -> Result<Json> {
    // Check if the entire decoding should be unstructured
    if matches!(context.settings, JsonStructureSettings::UnstructuredRaw) {
        return decode_unstructured_value(value);
@@ -370,7 +421,7 @@ pub fn decode_value_with_context<'a>(
 fn decode_struct_with_context<'a>(
    struct_value: StructValue,
    context: &JsonContext<'a>,
-) -> Result<Json, Error> {
+) -> Result<Json> {
    let mut json_object = Map::with_capacity(struct_value.len());

    let (items, fields) = struct_value.into_parts();
@@ -385,10 +436,7 @@ fn decode_struct_with_context<'a>(
 }

 /// Decode a list value to JSON array
-fn decode_list_with_context<'a>(
-    list_value: ListValue,
-    context: &JsonContext<'a>,
-) -> Result<Json, Error> {
+fn decode_list_with_context(list_value: ListValue, context: &JsonContext) -> Result<Json> {
    let mut json_array = Vec::with_capacity(list_value.len());

    let data_items = list_value.take_items();
@@ -403,7 +451,7 @@ fn decode_list_with_context<'a>(
 }

 /// Decode unstructured value (stored as string)
-fn decode_unstructured_value(value: Value) -> Result<Json, Error> {
+fn decode_unstructured_value(value: Value) -> Result<Json> {
    match value {
        // Handle expected format: StructValue with single _raw field
        Value::Struct(struct_value) => {
@@ -443,7 +491,7 @@ fn decode_unstructured_value(value: Value) -> Result<Json, Error> {
 }

 /// Decode primitive value to JSON
-fn decode_primitive_value(value: Value) -> Result<Json, Error> {
+fn decode_primitive_value(value: Value) -> Result<Json> {
    match value {
        Value::Null => Ok(Json::Null),
        Value::Boolean(b) => Ok(Json::Bool(b)),
@@ -487,7 +535,7 @@ fn decode_primitive_value(value: Value) -> Result<Json, Error> {
 fn decode_struct_with_settings<'a>(
    struct_value: StructValue,
    context: &JsonContext<'a>,
-) -> Result<StructValue, Error> {
+) -> Result<StructValue> {
    // Check if we can return the struct directly (Structured case)
    if matches!(context.settings, JsonStructureSettings::Structured(_)) {
        return Ok(struct_value);
@@ -567,7 +615,7 @@ fn decode_struct_with_settings<'a>(
 fn decode_list_with_settings<'a>(
    list_value: ListValue,
    context: &JsonContext<'a>,
-) -> Result<ListValue, Error> {
+) -> Result<ListValue> {
    let mut items = Vec::with_capacity(list_value.len());

    let (data_items, datatype) = list_value.into_parts();
@@ -592,7 +640,7 @@ fn decode_list_with_settings<'a>(
 }

 /// Helper function to decode a struct that was encoded with UnstructuredRaw settings
-fn decode_unstructured_raw_struct(struct_value: StructValue) -> Result<StructValue, Error> {
+fn decode_unstructured_raw_struct(struct_value: StructValue) -> Result<StructValue> {
    // For UnstructuredRaw, the struct must have exactly one field named "_raw"
    if struct_value.struct_type().fields().len() == 1 {
        let field = &struct_value.struct_type().fields()[0];
@@ -636,12 +684,9 @@ fn decode_unstructured_raw_struct(struct_value: StructValue) -> Result<StructVal
 }

 /// Helper function to try converting a value to an expected type
-fn try_convert_to_expected_type<T>(
-    value: T,
-    expected_type: &JsonNativeType,
-) -> Result<JsonValue, Error>
+fn try_convert_to_expected_type<T>(value: T, expected_type: &JsonNativeType) -> Result<JsonVariant>
 where
-    T: Into<JsonValue>,
+    T: Into<JsonVariant>,
 {
    let value = value.into();
    let cast_error = || {
@@ -650,7 +695,7 @@ where
        }
        .fail()
    };
-    let actual_type = value.json_type().native_type();
+    let actual_type = &value.native_type();
    match (actual_type, expected_type) {
        (x, y) if x == y => Ok(value),
        (JsonNativeType::Number(x), JsonNativeType::Number(y)) => match (x, y) {
@@ -691,6 +736,107 @@ mod tests {
    use crate::data_type::ConcreteDataType;
    use crate::types::ListType;

+    #[test]
+    fn test_encode_by_struct() {
+        let json_struct: StructType = [
+            StructField::new("s", ConcreteDataType::string_datatype(), true),
+            StructField::new("foo.i", ConcreteDataType::int64_datatype(), true),
+            StructField::new("x.y.z", ConcreteDataType::boolean_datatype(), true),
+        ]
+        .into();
+
+        let json = json!({
+            "s": "hello",
+            "t": "world",
+            "foo": {
+                "i": 1,
+                "j": 2
+            },
+            "x": {
+                "y": {
+                    "z": true
+                }
+            }
+        });
+        let value = encode_by_struct(&json_struct, json).unwrap();
+        assert_eq!(
+            value.to_string(),
+            r#"Json({ _raw: {"foo":{"j":2},"t":"world","x":{"y":{}}}, foo.i: 1, s: hello, x.y.z: true })"#
+        );
+
+        let json = json!({
+            "t": "world",
+            "foo": {
+                "i": 1,
+                "j": 2
+            },
+            "x": {
+                "y": {
+                    "z": true
+                }
+            }
+        });
+        let value = encode_by_struct(&json_struct, json).unwrap();
+        assert_eq!(
+            value.to_string(),
+            r#"Json({ _raw: {"foo":{"j":2},"t":"world","x":{"y":{}}}, foo.i: 1, x.y.z: true })"#
+        );
+
+        let json = json!({
+            "s": 1234,
+            "foo": {
+                "i": 1,
+                "j": 2
+            },
+            "x": {
+                "y": {
+                    "z": true
+                }
+            }
+        });
+        let value = encode_by_struct(&json_struct, json).unwrap();
+        assert_eq!(
+            value.to_string(),
+            r#"Json({ _raw: {"foo":{"j":2},"x":{"y":{}}}, foo.i: 1, s: 1234, x.y.z: true })"#
+        );
+
+        let json = json!({
+            "s": "hello",
+            "t": "world",
+            "foo": {
+                "i": "bar",
+                "j": 2
+            },
+            "x": {
+                "y": {
+                    "z": true
+                }
+            }
+        });
+        let result = encode_by_struct(&json_struct, json);
+        assert_eq!(
+            result.unwrap_err().to_string(),
+            "Cannot cast value bar to Number(I64)"
+        );
+
+        let json = json!({
+            "s": "hello",
+            "t": "world",
+            "foo": {
+                "i": 1,
+                "j": 2
+            },
+            "x": {
+                "y": "z"
+            }
+        });
+        let result = encode_by_struct(&json_struct, json);
+        assert_eq!(
+            result.unwrap_err().to_string(),
+            r#"Invalid JSON: expect "y" an object"#
+        );
+    }
+
    #[test]
    fn test_encode_json_null() {
        let json = Json::Null;
--- a/src/datatypes/src/json/value.rs
+++ b/src/datatypes/src/json/value.rs
@@ -82,6 +82,18 @@ impl From<f64> for JsonNumber {
    }
 }

+impl From<Number> for JsonNumber {
+    fn from(n: Number) -> Self {
+        if let Some(i) = n.as_i64() {
+            i.into()
+        } else if let Some(i) = n.as_u64() {
+            i.into()
+        } else {
+            n.as_f64().unwrap_or(f64::NAN).into()
+        }
+    }
+}
+
 impl Display for JsonNumber {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        match self {
@@ -109,7 +121,28 @@ pub enum JsonVariant {
 }

 impl JsonVariant {
-    fn native_type(&self) -> JsonNativeType {
+    pub(crate) fn as_i64(&self) -> Option<i64> {
+        match self {
+            JsonVariant::Number(n) => n.as_i64(),
+            _ => None,
+        }
+    }
+
+    pub(crate) fn as_u64(&self) -> Option<u64> {
+        match self {
+            JsonVariant::Number(n) => n.as_u64(),
+            _ => None,
+        }
+    }
+
+    pub(crate) fn as_f64(&self) -> Option<f64> {
+        match self {
+            JsonVariant::Number(n) => Some(n.as_f64()),
+            _ => None,
+        }
+    }
+
+    pub(crate) fn native_type(&self) -> JsonNativeType {
        match self {
            JsonVariant::Null => JsonNativeType::Null,
            JsonVariant::Bool(_) => JsonNativeType::Bool,
@@ -205,6 +238,32 @@ impl<K: Into<String>, V: Into<JsonVariant>, const N: usize> From<[(K, V); N]> fo
    }
 }

+impl From<serde_json::Value> for JsonVariant {
+    fn from(v: serde_json::Value) -> Self {
+        fn helper(v: serde_json::Value) -> JsonVariant {
+            match v {
+                serde_json::Value::Null => JsonVariant::Null,
+                serde_json::Value::Bool(b) => b.into(),
+                serde_json::Value::Number(n) => n.into(),
+                serde_json::Value::String(s) => s.into(),
+                serde_json::Value::Array(array) => {
+                    JsonVariant::Array(array.into_iter().map(helper).collect())
+                }
+                serde_json::Value::Object(object) => {
+                    JsonVariant::Object(object.into_iter().map(|(k, v)| (k, helper(v))).collect())
+                }
+            }
+        }
+        helper(v)
+    }
+}
+
+impl From<BTreeMap<String, JsonVariant>> for JsonVariant {
+    fn from(v: BTreeMap<String, JsonVariant>) -> Self {
+        Self::Object(v)
+    }
+}
+
 impl Display for JsonVariant {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        match self {
@@ -277,24 +336,11 @@ impl JsonValue {
    }

    pub(crate) fn as_i64(&self) -> Option<i64> {
-        match self.json_variant {
-            JsonVariant::Number(n) => n.as_i64(),
-            _ => None,
-        }
+        self.json_variant.as_i64()
    }

    pub(crate) fn as_u64(&self) -> Option<u64> {
-        match self.json_variant {
-            JsonVariant::Number(n) => n.as_u64(),
-            _ => None,
-        }
-    }
-
-    pub(crate) fn as_f64(&self) -> Option<f64> {
-        match self.json_variant {
-            JsonVariant::Number(n) => Some(n.as_f64()),
-            _ => None,
-        }
+        self.json_variant.as_u64()
    }

    pub(crate) fn as_f64_lossy(&self) -> Option<f64> {
--- a/src/datatypes/src/types/struct_type.rs
+++ b/src/datatypes/src/types/struct_type.rs
@@ -122,9 +122,9 @@ pub struct StructField {
 }

 impl StructField {
-    pub fn new(name: String, data_type: ConcreteDataType, nullable: bool) -> Self {
+    pub fn new<T: Into<String>>(name: T, data_type: ConcreteDataType, nullable: bool) -> Self {
        StructField {
-            name,
+            name: name.into(),
            data_type,
            nullable,
            metadata: BTreeMap::new(),
--- a/src/file-engine/src/engine.rs
+++ b/src/file-engine/src/engine.rs
@@ -26,10 +26,9 @@ use object_store::ObjectStore;
 use snafu::{OptionExt, ensure};
 use store_api::metadata::RegionMetadataRef;
 use store_api::region_engine::{
-    CopyRegionFromRequest, CopyRegionFromResponse, RegionEngine, RegionManifestInfo, RegionRole,
-    RegionScannerRef, RegionStatistic, RemapManifestsRequest, RemapManifestsResponse,
-    SetRegionRoleStateResponse, SetRegionRoleStateSuccess, SettableRegionRoleState,
-    SinglePartitionScanner, SyncManifestResponse,
+    RegionEngine, RegionRole, RegionScannerRef, RegionStatistic, RemapManifestsRequest,
+    RemapManifestsResponse, SetRegionRoleStateResponse, SetRegionRoleStateSuccess,
+    SettableRegionRoleState, SinglePartitionScanner, SyncRegionFromRequest, SyncRegionFromResponse,
 };
 use store_api::region_request::{
    AffectedRows, RegionCloseRequest, RegionCreateRequest, RegionDropRequest, RegionOpenRequest,
@@ -145,10 +144,10 @@ impl RegionEngine for FileRegionEngine {
    async fn sync_region(
        &self,
        _region_id: RegionId,
-        _manifest_info: RegionManifestInfo,
-    ) -> Result<SyncManifestResponse, BoxedError> {
+        _request: SyncRegionFromRequest,
+    ) -> Result<SyncRegionFromResponse, BoxedError> {
        // File engine doesn't need to sync region manifest.
-        Ok(SyncManifestResponse::NotSupported)
+        Ok(SyncRegionFromResponse::NotSupported)
    }

    async fn remap_manifests(
@@ -163,19 +162,6 @@ impl RegionEngine for FileRegionEngine {
        ))
    }

-    async fn copy_region_from(
-        &self,
-        _region_id: RegionId,
-        _request: CopyRegionFromRequest,
-    ) -> Result<CopyRegionFromResponse, BoxedError> {
-        Err(BoxedError::new(
-            UnsupportedSnafu {
-                operation: "copy_region_from",
-            }
-            .build(),
-        ))
-    }
-
    fn role(&self, region_id: RegionId) -> Option<RegionRole> {
        self.inner.state(region_id)
    }
--- a/src/flow/src/server.rs
+++ b/src/flow/src/server.rs
@@ -490,7 +490,6 @@ impl<'a> FlownodeServiceBuilder<'a> {
        let config = GrpcServerConfig {
            max_recv_message_size: opts.grpc.max_recv_message_size.as_bytes() as usize,
            max_send_message_size: opts.grpc.max_send_message_size.as_bytes() as usize,
-            max_total_message_memory: opts.grpc.max_total_message_memory.as_bytes() as usize,
            tls: opts.grpc.tls.clone(),
            max_connection_age: opts.grpc.max_connection_age,
        };
--- a/src/frontend/Cargo.toml
+++ b/src/frontend/Cargo.toml
@@ -32,6 +32,7 @@ common-frontend.workspace = true
 common-function.workspace = true
 common-grpc.workspace = true
 common-macro.workspace = true
+common-memory-manager.workspace = true
 common-meta.workspace = true
 common-options.workspace = true
 common-procedure.workspace = true
--- a/src/frontend/src/error.rs
+++ b/src/frontend/src/error.rs
@@ -357,14 +357,6 @@ pub enum Error {
        location: Location,
    },

-    #[snafu(display("Failed to acquire more permits from limiter"))]
-    AcquireLimiter {
-        #[snafu(source)]
-        error: tokio::sync::AcquireError,
-        #[snafu(implicit)]
-        location: Location,
-    },
-
    #[snafu(display("Service suspended"))]
    Suspended {
        #[snafu(implicit)]
@@ -449,8 +441,6 @@ impl ErrorExt for Error {

            Error::StatementTimeout { .. } => StatusCode::Cancelled,

-            Error::AcquireLimiter { .. } => StatusCode::Internal,
-
            Error::Suspended { .. } => StatusCode::Suspended,
        }
    }
--- a/src/frontend/src/frontend.rs
+++ b/src/frontend/src/frontend.rs
@@ -17,6 +17,7 @@ use std::sync::Arc;
 use common_base::readable_size::ReadableSize;
 use common_config::config::Configurable;
 use common_event_recorder::EventRecorderOptions;
+use common_memory_manager::OnExhaustedPolicy;
 use common_options::datanode::DatanodeClientOptions;
 use common_options::memory::MemoryOptions;
 use common_telemetry::logging::{LoggingOptions, SlowQueryOptions, TracingOptions};
@@ -45,6 +46,12 @@ pub struct FrontendOptions {
    pub default_timezone: Option<String>,
    pub default_column_prefix: Option<String>,
    pub heartbeat: HeartbeatOptions,
+    /// Maximum total memory for all concurrent write request bodies and messages (HTTP, gRPC, Flight).
+    /// Set to 0 to disable the limit. Default: "0" (unlimited)
+    pub max_in_flight_write_bytes: ReadableSize,
+    /// Policy when write bytes quota is exhausted.
+    /// Options: "wait" (default, 10s), "wait(<duration>)", "fail"
+    pub write_bytes_exhausted_policy: OnExhaustedPolicy,
    pub http: HttpOptions,
    pub grpc: GrpcOptions,
    /// The internal gRPC options for the frontend service.
@@ -63,7 +70,6 @@ pub struct FrontendOptions {
    pub user_provider: Option<String>,
    pub tracing: TracingOptions,
    pub query: QueryOptions,
-    pub max_in_flight_write_bytes: Option<ReadableSize>,
    pub slow_query: SlowQueryOptions,
    pub memory: MemoryOptions,
    /// The event recorder options.
@@ -77,6 +83,8 @@ impl Default for FrontendOptions {
            default_timezone: None,
            default_column_prefix: None,
            heartbeat: HeartbeatOptions::frontend_default(),
+            max_in_flight_write_bytes: ReadableSize(0),
+            write_bytes_exhausted_policy: OnExhaustedPolicy::default(),
            http: HttpOptions::default(),
            grpc: GrpcOptions::default(),
            internal_grpc: None,
@@ -93,7 +101,6 @@ impl Default for FrontendOptions {
            user_provider: None,
            tracing: TracingOptions::default(),
            query: QueryOptions::default(),
-            max_in_flight_write_bytes: None,
            slow_query: SlowQueryOptions::default(),
            memory: MemoryOptions::default(),
            event_recorder: EventRecorderOptions::default(),
--- a/src/frontend/src/instance.rs
+++ b/src/frontend/src/instance.rs
@@ -97,7 +97,6 @@ use crate::error::{
    ParseSqlSnafu, PermissionSnafu, PlanStatementSnafu, Result, SqlExecInterceptedSnafu,
    StatementTimeoutSnafu, TableOperationSnafu,
 };
-use crate::limiter::LimiterRef;
 use crate::stream_wrapper::CancellableStreamWrapper;

 lazy_static! {
@@ -118,7 +117,6 @@ pub struct Instance {
    deleter: DeleterRef,
    table_metadata_manager: TableMetadataManagerRef,
    event_recorder: Option<EventRecorderRef>,
-    limiter: Option<LimiterRef>,
    process_manager: ProcessManagerRef,
    slow_query_options: SlowQueryOptions,
    suspend: Arc<AtomicBool>,
--- a/src/frontend/src/instance/builder.rs
+++ b/src/frontend/src/instance/builder.rs
@@ -49,7 +49,6 @@ use crate::events::EventHandlerImpl;
 use crate::frontend::FrontendOptions;
 use crate::instance::Instance;
 use crate::instance::region_query::FrontendRegionQueryHandler;
-use crate::limiter::Limiter;

 /// The frontend [`Instance`] builder.
 pub struct FrontendBuilder {
@@ -248,14 +247,6 @@ impl FrontendBuilder {
            self.options.event_recorder.ttl,
        ))));

-        // Create the limiter if the max_in_flight_write_bytes is set.
-        let limiter = self
-            .options
-            .max_in_flight_write_bytes
-            .map(|max_in_flight_write_bytes| {
-                Arc::new(Limiter::new(max_in_flight_write_bytes.as_bytes() as usize))
-            });
-
        Ok(Instance {
            catalog_manager: self.catalog_manager,
            pipeline_operator,
@@ -266,7 +257,6 @@ impl FrontendBuilder {
            deleter,
            table_metadata_manager: Arc::new(TableMetadataManager::new(kv_backend)),
            event_recorder: Some(event_recorder),
-            limiter,
            process_manager,
            otlp_metrics_table_legacy_cache: DashMap::new(),
            slow_query_options: self.options.slow_query.clone(),
--- a/src/frontend/src/instance/grpc.rs
+++ b/src/frontend/src/instance/grpc.rs
@@ -71,12 +71,6 @@ impl GrpcQueryHandler for Instance {
            .check_permission(ctx.current_user(), PermissionReq::GrpcRequest(&request))
            .context(PermissionSnafu)?;

-        let _guard = if let Some(limiter) = &self.limiter {
-            Some(limiter.limit_request(&request).await?)
-        } else {
-            None
-        };
-
        let output = match request {
            Request::Inserts(requests) => self.handle_inserts(requests, ctx.clone()).await?,
            Request::RowInserts(requests) => {
--- a/src/frontend/src/instance/influxdb.rs
+++ b/src/frontend/src/instance/influxdb.rs
@@ -22,7 +22,7 @@ use common_error::ext::BoxedError;
 use common_time::Timestamp;
 use common_time::timestamp::TimeUnit;
 use servers::error::{
-    AuthSnafu, CatalogSnafu, Error, OtherSnafu, TimestampOverflowSnafu, UnexpectedResultSnafu,
+    AuthSnafu, CatalogSnafu, Error, TimestampOverflowSnafu, UnexpectedResultSnafu,
 };
 use servers::influxdb::InfluxdbRequest;
 use servers::interceptor::{LineProtocolInterceptor, LineProtocolInterceptorRef};
@@ -59,18 +59,6 @@ impl InfluxdbLineProtocolHandler for Instance {
            .post_lines_conversion(requests, ctx.clone())
            .await?;

-        let _guard = if let Some(limiter) = &self.limiter {
-            Some(
-                limiter
-                    .limit_row_inserts(&requests)
-                    .await
-                    .map_err(BoxedError::new)
-                    .context(OtherSnafu)?,
-            )
-        } else {
-            None
-        };
-
        self.handle_influx_row_inserts(requests, ctx)
            .await
            .map_err(BoxedError::new)
--- a/src/frontend/src/instance/log_handler.rs
+++ b/src/frontend/src/instance/log_handler.rs
@@ -23,8 +23,7 @@ use datatypes::timestamp::TimestampNanosecond;
 use pipeline::pipeline_operator::PipelineOperator;
 use pipeline::{Pipeline, PipelineInfo, PipelineVersion};
 use servers::error::{
-    AuthSnafu, Error as ServerError, ExecuteGrpcRequestSnafu, OtherSnafu, PipelineSnafu,
-    Result as ServerResult,
+    AuthSnafu, Error as ServerError, ExecuteGrpcRequestSnafu, PipelineSnafu, Result as ServerResult,
 };
 use servers::interceptor::{LogIngestInterceptor, LogIngestInterceptorRef};
 use servers::query_handler::PipelineHandler;
@@ -124,18 +123,6 @@ impl Instance {
        log: RowInsertRequests,
        ctx: QueryContextRef,
    ) -> ServerResult<Output> {
-        let _guard = if let Some(limiter) = &self.limiter {
-            Some(
-                limiter
-                    .limit_row_inserts(&log)
-                    .await
-                    .map_err(BoxedError::new)
-                    .context(OtherSnafu)?,
-            )
-        } else {
-            None
-        };
-
        self.inserter
            .handle_log_inserts(log, ctx, self.statement_executor.as_ref())
            .await
@@ -148,18 +135,6 @@ impl Instance {
        rows: RowInsertRequests,
        ctx: QueryContextRef,
    ) -> ServerResult<Output> {
-        let _guard = if let Some(limiter) = &self.limiter {
-            Some(
-                limiter
-                    .limit_row_inserts(&rows)
-                    .await
-                    .map_err(BoxedError::new)
-                    .context(OtherSnafu)?,
-            )
-        } else {
-            None
-        };
-
        self.inserter
            .handle_trace_inserts(rows, ctx, self.statement_executor.as_ref())
            .await
--- a/src/frontend/src/instance/opentsdb.rs
+++ b/src/frontend/src/instance/opentsdb.rs
@@ -16,7 +16,7 @@ use async_trait::async_trait;
 use auth::{PermissionChecker, PermissionCheckerRef, PermissionReq};
 use common_error::ext::BoxedError;
 use common_telemetry::tracing;
-use servers::error::{self as server_error, AuthSnafu, ExecuteGrpcQuerySnafu, OtherSnafu};
+use servers::error::{self as server_error, AuthSnafu, ExecuteGrpcQuerySnafu};
 use servers::opentsdb::codec::DataPoint;
 use servers::opentsdb::data_point_to_grpc_row_insert_requests;
 use servers::query_handler::OpentsdbProtocolHandler;
@@ -41,18 +41,6 @@ impl OpentsdbProtocolHandler for Instance {

        let (requests, _) = data_point_to_grpc_row_insert_requests(data_points)?;

-        let _guard = if let Some(limiter) = &self.limiter {
-            Some(
-                limiter
-                    .limit_row_inserts(&requests)
-                    .await
-                    .map_err(BoxedError::new)
-                    .context(OtherSnafu)?,
-            )
-        } else {
-            None
-        };
-
        // OpenTSDB is single value.
        let output = self
            .handle_row_inserts(requests, ctx, true, true)
--- a/src/frontend/src/instance/otlp.rs
+++ b/src/frontend/src/instance/otlp.rs
@@ -24,7 +24,7 @@ use opentelemetry_proto::tonic::collector::logs::v1::ExportLogsServiceRequest;
 use opentelemetry_proto::tonic::collector::trace::v1::ExportTraceServiceRequest;
 use otel_arrow_rust::proto::opentelemetry::collector::metrics::v1::ExportMetricsServiceRequest;
 use pipeline::{GreptimePipelineParams, PipelineWay};
-use servers::error::{self, AuthSnafu, OtherSnafu, Result as ServerResult};
+use servers::error::{self, AuthSnafu, Result as ServerResult};
 use servers::http::prom_store::PHYSICAL_TABLE_PARAM;
 use servers::interceptor::{OpenTelemetryProtocolInterceptor, OpenTelemetryProtocolInterceptorRef};
 use servers::otlp;
@@ -83,18 +83,6 @@ impl OpenTelemetryProtocolHandler for Instance {
            ctx
        };

-        let _guard = if let Some(limiter) = &self.limiter {
-            Some(
-                limiter
-                    .limit_row_inserts(&requests)
-                    .await
-                    .map_err(BoxedError::new)
-                    .context(OtherSnafu)?,
-            )
-        } else {
-            None
-        };
-
        // If the user uses the legacy path, it is by default without metric engine.
        if metric_ctx.is_legacy || !metric_ctx.with_metric_engine {
            self.handle_row_inserts(requests, ctx, false, false)
@@ -191,18 +179,6 @@ impl OpenTelemetryProtocolHandler for Instance {
        )
        .await?;

-        let _guard = if let Some(limiter) = &self.limiter {
-            Some(
-                limiter
-                    .limit_ctx_req(&opt_req)
-                    .await
-                    .map_err(BoxedError::new)
-                    .context(OtherSnafu)?,
-            )
-        } else {
-            None
-        };
-
        let mut outputs = vec![];

        for (temp_ctx, requests) in opt_req.as_req_iter(ctx) {
--- a/src/frontend/src/instance/prom_store.rs
+++ b/src/frontend/src/instance/prom_store.rs
@@ -175,18 +175,6 @@ impl PromStoreProtocolHandler for Instance {
            .get::<PromStoreProtocolInterceptorRef<servers::error::Error>>();
        interceptor_ref.pre_write(&request, ctx.clone())?;

-        let _guard = if let Some(limiter) = &self.limiter {
-            Some(
-                limiter
-                    .limit_row_inserts(&request)
-                    .await
-                    .map_err(BoxedError::new)
-                    .context(error::OtherSnafu)?,
-            )
-        } else {
-            None
-        };
-
        let output = if with_metric_engine {
            let physical_table = ctx
                .extension(PHYSICAL_TABLE_PARAM)
--- a/src/frontend/src/lib.rs
+++ b/src/frontend/src/lib.rs
@@ -19,7 +19,6 @@ pub mod events;
 pub mod frontend;
 pub mod heartbeat;
 pub mod instance;
-pub(crate) mod limiter;
 pub(crate) mod metrics;
 pub mod server;
 pub mod service_config;
--- a/src/frontend/src/limiter.rs
+++ b/src/frontend/src/limiter.rs
@@ -1,332 +0,0 @@
-// Copyright 2023 Greptime Team
-//
-// Licensed under the Apache License, Version 2.0 (the "License");
-// you may not use this file except in compliance with the License.
-// You may obtain a copy of the License at
-//
-//     http://www.apache.org/licenses/LICENSE-2.0
-//
-// Unless required by applicable law or agreed to in writing, software
-// distributed under the License is distributed on an "AS IS" BASIS,
-// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-// See the License for the specific language governing permissions and
-// limitations under the License.
-
-use std::sync::Arc;
-
-use api::v1::column::Values;
-use api::v1::greptime_request::Request;
-use api::v1::value::ValueData;
-use api::v1::{
-    Decimal128, InsertRequests, IntervalMonthDayNano, JsonValue, RowInsertRequest,
-    RowInsertRequests, json_value,
-};
-use pipeline::ContextReq;
-use snafu::ResultExt;
-use tokio::sync::{OwnedSemaphorePermit, Semaphore};
-
-use crate::error::{AcquireLimiterSnafu, Result};
-
-pub(crate) type LimiterRef = Arc<Limiter>;
-
-/// A frontend request limiter that controls the total size of in-flight write
-/// requests.
-pub(crate) struct Limiter {
-    max_in_flight_write_bytes: usize,
-    byte_counter: Arc<Semaphore>,
-}
-
-impl Limiter {
-    pub fn new(max_in_flight_write_bytes: usize) -> Self {
-        Self {
-            byte_counter: Arc::new(Semaphore::new(max_in_flight_write_bytes)),
-            max_in_flight_write_bytes,
-        }
-    }
-
-    pub async fn limit_request(&self, request: &Request) -> Result<OwnedSemaphorePermit> {
-        let size = match request {
-            Request::Inserts(requests) => self.insert_requests_data_size(requests),
-            Request::RowInserts(requests) => {
-                self.rows_insert_requests_data_size(requests.inserts.iter())
-            }
-            _ => 0,
-        };
-        self.limit_in_flight_write_bytes(size).await
-    }
-
-    pub async fn limit_row_inserts(
-        &self,
-        requests: &RowInsertRequests,
-    ) -> Result<OwnedSemaphorePermit> {
-        let size = self.rows_insert_requests_data_size(requests.inserts.iter());
-        self.limit_in_flight_write_bytes(size).await
-    }
-
-    pub async fn limit_ctx_req(&self, opt_req: &ContextReq) -> Result<OwnedSemaphorePermit> {
-        let size = self.rows_insert_requests_data_size(opt_req.ref_all_req());
-        self.limit_in_flight_write_bytes(size).await
-    }
-
-    /// Await until more inflight bytes are available
-    pub async fn limit_in_flight_write_bytes(&self, bytes: usize) -> Result<OwnedSemaphorePermit> {
-        self.byte_counter
-            .clone()
-            .acquire_many_owned(bytes as u32)
-            .await
-            .context(AcquireLimiterSnafu)
-    }
-
-    /// Returns the current in-flight write bytes.
-    #[allow(dead_code)]
-    pub fn in_flight_write_bytes(&self) -> usize {
-        self.max_in_flight_write_bytes - self.byte_counter.available_permits()
-    }
-
-    fn insert_requests_data_size(&self, request: &InsertRequests) -> usize {
-        let mut size: usize = 0;
-        for insert in &request.inserts {
-            for column in &insert.columns {
-                if let Some(values) = &column.values {
-                    size += Self::size_of_column_values(values);
-                }
-            }
-        }
-        size
-    }
-
-    fn rows_insert_requests_data_size<'a>(
-        &self,
-        inserts: impl Iterator<Item = &'a RowInsertRequest>,
-    ) -> usize {
-        let mut size: usize = 0;
-        for insert in inserts {
-            if let Some(rows) = &insert.rows {
-                for row in &rows.rows {
-                    for value in &row.values {
-                        if let Some(value) = &value.value_data {
-                            size += Self::size_of_value_data(value);
-                        }
-                    }
-                }
-            }
-        }
-        size
-    }
-
-    fn size_of_column_values(values: &Values) -> usize {
-        let mut size: usize = 0;
-        size += values.i8_values.len() * size_of::<i32>();
-        size += values.i16_values.len() * size_of::<i32>();
-        size += values.i32_values.len() * size_of::<i32>();
-        size += values.i64_values.len() * size_of::<i64>();
-        size += values.u8_values.len() * size_of::<u32>();
-        size += values.u16_values.len() * size_of::<u32>();
-        size += values.u32_values.len() * size_of::<u32>();
-        size += values.u64_values.len() * size_of::<u64>();
-        size += values.f32_values.len() * size_of::<f32>();
-        size += values.f64_values.len() * size_of::<f64>();
-        size += values.bool_values.len() * size_of::<bool>();
-        size += values
-            .binary_values
-            .iter()
-            .map(|v| v.len() * size_of::<u8>())
-            .sum::<usize>();
-        size += values.string_values.iter().map(|v| v.len()).sum::<usize>();
-        size += values.date_values.len() * size_of::<i32>();
-        size += values.datetime_values.len() * size_of::<i64>();
-        size += values.timestamp_second_values.len() * size_of::<i64>();
-        size += values.timestamp_millisecond_values.len() * size_of::<i64>();
-        size += values.timestamp_microsecond_values.len() * size_of::<i64>();
-        size += values.timestamp_nanosecond_values.len() * size_of::<i64>();
-        size += values.time_second_values.len() * size_of::<i64>();
-        size += values.time_millisecond_values.len() * size_of::<i64>();
-        size += values.time_microsecond_values.len() * size_of::<i64>();
-        size += values.time_nanosecond_values.len() * size_of::<i64>();
-        size += values.interval_year_month_values.len() * size_of::<i64>();
-        size += values.interval_day_time_values.len() * size_of::<i64>();
-        size += values.interval_month_day_nano_values.len() * size_of::<IntervalMonthDayNano>();
-        size += values.decimal128_values.len() * size_of::<Decimal128>();
-        size += values
-            .list_values
-            .iter()
-            .map(|v| {
-                v.items
-                    .iter()
-                    .map(|item| {
-                        item.value_data
-                            .as_ref()
-                            .map(Self::size_of_value_data)
-                            .unwrap_or(0)
-                    })
-                    .sum::<usize>()
-            })
-            .sum::<usize>();
-        size += values
-            .struct_values
-            .iter()
-            .map(|v| {
-                v.items
-                    .iter()
-                    .map(|item| {
-                        item.value_data
-                            .as_ref()
-                            .map(Self::size_of_value_data)
-                            .unwrap_or(0)
-                    })
-                    .sum::<usize>()
-            })
-            .sum::<usize>();
-
-        size
-    }
-
-    fn size_of_value_data(value: &ValueData) -> usize {
-        match value {
-            ValueData::I8Value(_) => size_of::<i32>(),
-            ValueData::I16Value(_) => size_of::<i32>(),
-            ValueData::I32Value(_) => size_of::<i32>(),
-            ValueData::I64Value(_) => size_of::<i64>(),
-            ValueData::U8Value(_) => size_of::<u32>(),
-            ValueData::U16Value(_) => size_of::<u32>(),
-            ValueData::U32Value(_) => size_of::<u32>(),
-            ValueData::U64Value(_) => size_of::<u64>(),
-            ValueData::F32Value(_) => size_of::<f32>(),
-            ValueData::F64Value(_) => size_of::<f64>(),
-            ValueData::BoolValue(_) => size_of::<bool>(),
-            ValueData::BinaryValue(v) => v.len() * size_of::<u8>(),
-            ValueData::StringValue(v) => v.len(),
-            ValueData::DateValue(_) => size_of::<i32>(),
-            ValueData::DatetimeValue(_) => size_of::<i64>(),
-            ValueData::TimestampSecondValue(_) => size_of::<i64>(),
-            ValueData::TimestampMillisecondValue(_) => size_of::<i64>(),
-            ValueData::TimestampMicrosecondValue(_) => size_of::<i64>(),
-            ValueData::TimestampNanosecondValue(_) => size_of::<i64>(),
-            ValueData::TimeSecondValue(_) => size_of::<i64>(),
-            ValueData::TimeMillisecondValue(_) => size_of::<i64>(),
-            ValueData::TimeMicrosecondValue(_) => size_of::<i64>(),
-            ValueData::TimeNanosecondValue(_) => size_of::<i64>(),
-            ValueData::IntervalYearMonthValue(_) => size_of::<i32>(),
-            ValueData::IntervalDayTimeValue(_) => size_of::<i64>(),
-            ValueData::IntervalMonthDayNanoValue(_) => size_of::<IntervalMonthDayNano>(),
-            ValueData::Decimal128Value(_) => size_of::<Decimal128>(),
-            ValueData::ListValue(list_values) => list_values
-                .items
-                .iter()
-                .map(|item| {
-                    item.value_data
-                        .as_ref()
-                        .map(Self::size_of_value_data)
-                        .unwrap_or(0)
-                })
-                .sum(),
-            ValueData::StructValue(struct_values) => struct_values
-                .items
-                .iter()
-                .map(|item| {
-                    item.value_data
-                        .as_ref()
-                        .map(Self::size_of_value_data)
-                        .unwrap_or(0)
-                })
-                .sum(),
-            ValueData::JsonValue(v) => {
-                fn calc(v: &JsonValue) -> usize {
-                    let Some(value) = v.value.as_ref() else {
-                        return 0;
-                    };
-                    match value {
-                        json_value::Value::Boolean(_) => size_of::<bool>(),
-                        json_value::Value::Int(_) => size_of::<i64>(),
-                        json_value::Value::Uint(_) => size_of::<u64>(),
-                        json_value::Value::Float(_) => size_of::<f64>(),
-                        json_value::Value::Str(s) => s.len(),
-                        json_value::Value::Array(array) => array.items.iter().map(calc).sum(),
-                        json_value::Value::Object(object) => object
-                            .entries
-                            .iter()
-                            .flat_map(|entry| {
-                                entry.value.as_ref().map(|v| entry.key.len() + calc(v))
-                            })
-                            .sum(),
-                    }
-                }
-                calc(v)
-            }
-        }
-    }
-}
-
-#[cfg(test)]
-mod tests {
-    use api::v1::column::Values;
-    use api::v1::greptime_request::Request;
-    use api::v1::{Column, InsertRequest};
-
-    use super::*;
-
-    fn generate_request(size: usize) -> Request {
-        let i8_values = vec![0; size / 4];
-        Request::Inserts(InsertRequests {
-            inserts: vec![InsertRequest {
-                columns: vec![Column {
-                    values: Some(Values {
-                        i8_values,
-                        ..Default::default()
-                    }),
-                    ..Default::default()
-                }],
-                ..Default::default()
-            }],
-        })
-    }
-
-    #[tokio::test]
-    async fn test_limiter() {
-        let limiter_ref: LimiterRef = Arc::new(Limiter::new(1024));
-        let tasks_count = 10;
-        let request_data_size = 100;
-        let mut handles = vec![];
-
-        // Generate multiple requests to test the limiter.
-        for _ in 0..tasks_count {
-            let limiter = limiter_ref.clone();
-            let handle = tokio::spawn(async move {
-                let result = limiter
-                    .limit_request(&generate_request(request_data_size))
-                    .await;
-                assert!(result.is_ok());
-            });
-            handles.push(handle);
-        }
-
-        // Wait for all threads to complete.
-        for handle in handles {
-            handle.await.unwrap();
-        }
-    }
-
-    #[tokio::test]
-    async fn test_in_flight_write_bytes() {
-        let limiter_ref: LimiterRef = Arc::new(Limiter::new(1024));
-        let req1 = generate_request(100);
-        let result1 = limiter_ref
-            .limit_request(&req1)
-            .await
-            .expect("failed to acquire permits");
-        assert_eq!(limiter_ref.in_flight_write_bytes(), 100);
-
-        let req2 = generate_request(200);
-        let result2 = limiter_ref
-            .limit_request(&req2)
-            .await
-            .expect("failed to acquire permits");
-        assert_eq!(limiter_ref.in_flight_write_bytes(), 300);
-
-        drop(result1);
-        assert_eq!(limiter_ref.in_flight_write_bytes(), 200);
-
-        drop(result2);
-        assert_eq!(limiter_ref.in_flight_write_bytes(), 0);
-    }
-}
--- a/src/frontend/src/server.rs
+++ b/src/frontend/src/server.rs
@@ -40,6 +40,7 @@ use servers::otel_arrow::OtelArrowServiceHandler;
 use servers::postgres::PostgresServer;
 use servers::query_handler::grpc::ServerGrpcQueryHandlerAdapter;
 use servers::query_handler::sql::ServerSqlQueryHandlerAdapter;
+use servers::request_memory_limiter::ServerMemoryLimiter;
 use servers::server::{Server, ServerHandlers};
 use servers::tls::{ReloadableTlsServerConfig, maybe_watch_server_tls_config};
 use snafu::ResultExt;
@@ -59,6 +60,7 @@ where
    http_server_builder: Option<HttpServerBuilder>,
    plugins: Plugins,
    flight_handler: Option<FlightCraftRef>,
+    pub server_memory_limiter: ServerMemoryLimiter,
 }

 impl<T> Services<T>
@@ -66,6 +68,13 @@ where
    T: Into<FrontendOptions> + Configurable + Clone,
 {
    pub fn new(opts: T, instance: Arc<Instance>, plugins: Plugins) -> Self {
+        let feopts = opts.clone().into();
+        // Create server request memory limiter for all server protocols
+        let server_memory_limiter = ServerMemoryLimiter::new(
+            feopts.max_in_flight_write_bytes.as_bytes(),
+            feopts.write_bytes_exhausted_policy,
+        );
+
        Self {
            opts,
            instance,
@@ -73,18 +82,29 @@ where
            http_server_builder: None,
            plugins,
            flight_handler: None,
+            server_memory_limiter,
        }
    }

-    pub fn grpc_server_builder(&self, opts: &GrpcOptions) -> Result<GrpcServerBuilder> {
+    pub fn grpc_server_builder(
+        &self,
+        opts: &GrpcOptions,
+        request_memory_limiter: ServerMemoryLimiter,
+    ) -> Result<GrpcServerBuilder> {
        let builder = GrpcServerBuilder::new(opts.as_config(), common_runtime::global_runtime())
+            .with_memory_limiter(request_memory_limiter)
            .with_tls_config(opts.tls.clone())
            .context(error::InvalidTlsConfigSnafu)?;
        Ok(builder)
    }

-    pub fn http_server_builder(&self, opts: &FrontendOptions) -> HttpServerBuilder {
+    pub fn http_server_builder(
+        &self,
+        opts: &FrontendOptions,
+        request_memory_limiter: ServerMemoryLimiter,
+    ) -> HttpServerBuilder {
        let mut builder = HttpServerBuilder::new(opts.http.clone())
+            .with_memory_limiter(request_memory_limiter)
            .with_sql_handler(ServerSqlQueryHandlerAdapter::arc(self.instance.clone()));

        let validator = self.plugins.get::<LogValidatorRef>();
@@ -169,11 +189,12 @@ where
        meta_client: &Option<MetaClientOptions>,
        name: Option<String>,
        external: bool,
+        request_memory_limiter: ServerMemoryLimiter,
    ) -> Result<GrpcServer> {
        let builder = if let Some(builder) = self.grpc_server_builder.take() {
            builder
        } else {
-            self.grpc_server_builder(grpc)?
+            self.grpc_server_builder(grpc, request_memory_limiter)?
        };

        let user_provider = if external {
@@ -235,11 +256,16 @@ where
        Ok(grpc_server)
    }

-    fn build_http_server(&mut self, opts: &FrontendOptions, toml: String) -> Result<HttpServer> {
+    fn build_http_server(
+        &mut self,
+        opts: &FrontendOptions,
+        toml: String,
+        request_memory_limiter: ServerMemoryLimiter,
+    ) -> Result<HttpServer> {
        let builder = if let Some(builder) = self.http_server_builder.take() {
            builder
        } else {
-            self.http_server_builder(opts)
+            self.http_server_builder(opts, request_memory_limiter)
        };

        let http_server = builder
@@ -264,7 +290,13 @@ where
        {
            // Always init GRPC server
            let grpc_addr = parse_addr(&opts.grpc.bind_addr)?;
-            let grpc_server = self.build_grpc_server(&opts.grpc, &opts.meta_client, None, true)?;
+            let grpc_server = self.build_grpc_server(
+                &opts.grpc,
+                &opts.meta_client,
+                None,
+                true,
+                self.server_memory_limiter.clone(),
+            )?;
            handlers.insert((Box::new(grpc_server), grpc_addr));
        }

@@ -276,6 +308,7 @@ where
                &opts.meta_client,
                Some("INTERNAL_GRPC_SERVER".to_string()),
                false,
+                self.server_memory_limiter.clone(),
            )?;
            handlers.insert((Box::new(grpc_server), grpc_addr));
        }
@@ -284,7 +317,8 @@ where
            // Always init HTTP server
            let http_options = &opts.http;
            let http_addr = parse_addr(&http_options.addr)?;
-            let http_server = self.build_http_server(&opts, toml)?;
+            let http_server =
+                self.build_http_server(&opts, toml, self.server_memory_limiter.clone())?;
            handlers.insert((Box::new(http_server), http_addr));
        }

--- a/src/meta-srv/src/bootstrap.rs
+++ b/src/meta-srv/src/bootstrap.rs
@@ -339,6 +339,7 @@ pub async fn metasrv_builder(
                opts.meta_schema_name.as_deref(),
                &opts.meta_table_name,
                opts.max_txn_ops,
+                opts.auto_create_schema,
            )
            .await
            .context(error::KvBackendSnafu)?;
--- a/src/meta-srv/src/error.rs
+++ b/src/meta-srv/src/error.rs
@@ -17,6 +17,7 @@ use common_error::ext::{BoxedError, ErrorExt};
 use common_error::status_code::StatusCode;
 use common_macro::stack_trace_debug;
 use common_meta::DatanodeId;
+use common_procedure::ProcedureId;
 use common_runtime::JoinError;
 use snafu::{Location, Snafu};
 use store_api::storage::RegionId;
@@ -768,6 +769,35 @@ pub enum Error {
        location: Location,
    },

+    #[snafu(display("Failed to create repartition subtasks"))]
+    RepartitionCreateSubtasks {
+        source: partition::error::Error,
+        #[snafu(implicit)]
+        location: Location,
+    },
+
+    #[snafu(display(
+        "Source partition expression '{}' does not match any existing region",
+        expr
+    ))]
+    RepartitionSourceExprMismatch {
+        expr: String,
+        #[snafu(implicit)]
+        location: Location,
+    },
+
+    #[snafu(display(
+        "Failed to get the state receiver for repartition subprocedure {}",
+        procedure_id
+    ))]
+    RepartitionSubprocedureStateReceiver {
+        procedure_id: ProcedureId,
+        #[snafu(source)]
+        source: common_procedure::Error,
+        #[snafu(implicit)]
+        location: Location,
+    },
+
    #[snafu(display("Unsupported operation {}", operation))]
    Unsupported {
        operation: String,
@@ -1113,7 +1143,8 @@ impl ErrorExt for Error {
            | Error::LeaderPeerChanged { .. }
            | Error::RepartitionSourceRegionMissing { .. }
            | Error::RepartitionTargetRegionMissing { .. }
-            | Error::PartitionExprMismatch { .. } => StatusCode::InvalidArguments,
+            | Error::PartitionExprMismatch { .. }
+            | Error::RepartitionSourceExprMismatch { .. } => StatusCode::InvalidArguments,
            Error::LeaseKeyFromUtf8 { .. }
            | Error::LeaseValueFromUtf8 { .. }
            | Error::InvalidRegionKeyFromUtf8 { .. }
@@ -1173,6 +1204,8 @@ impl ErrorExt for Error {

            Error::BuildTlsOptions { source, .. } => source.status_code(),
            Error::Other { source, .. } => source.status_code(),
+            Error::RepartitionCreateSubtasks { source, .. } => source.status_code(),
+            Error::RepartitionSubprocedureStateReceiver { source, .. } => source.status_code(),
            Error::NoEnoughAvailableNode { .. } => StatusCode::RuntimeResourcesExhausted,

            #[cfg(feature = "pg_kvbackend")]
--- a/src/meta-srv/src/gc/ctx.rs
+++ b/src/meta-srv/src/gc/ctx.rs
@@ -194,7 +194,7 @@ impl SchedulerCtx for DefaultGcSchedulerCtx {
        }

        // Send GetFileRefs instructions to each datanode
-        let mut all_file_refs: HashMap<RegionId, HashSet<FileId>> = HashMap::new();
+        let mut all_file_refs: HashMap<RegionId, HashSet<_>> = HashMap::new();
        let mut all_manifest_versions = HashMap::new();

        for (peer, regions) in datanode2query_regions {
--- a/src/meta-srv/src/gc/mock.rs
+++ b/src/meta-srv/src/gc/mock.rs
@@ -53,6 +53,7 @@ pub fn new_empty_report_with(region_ids: impl IntoIterator<Item = RegionId>) ->
    }
    GcReport {
        deleted_files,
+        deleted_indexes: HashMap::new(),
        need_retry_regions: HashSet::new(),
    }
 }
--- a/src/meta-srv/src/gc/mock/concurrent.rs
+++ b/src/meta-srv/src/gc/mock/concurrent.rs
@@ -454,7 +454,11 @@ async fn test_region_gc_concurrency_with_retryable_errors() {
            (
                region_id,
                // mock the actual gc report with deleted files when succeeded(even no files to delete)
-                GcReport::new(HashMap::from([(region_id, vec![])]), HashSet::new()),
+                GcReport::new(
+                    HashMap::from([(region_id, vec![])]),
+                    Default::default(),
+                    HashSet::new(),
+                ),
            )
        })
        .collect();
--- a/src/meta-srv/src/gc/mock/err_handle.rs
+++ b/src/meta-srv/src/gc/mock/err_handle.rs
@@ -20,7 +20,7 @@ use common_meta::datanode::RegionManifestInfo;
 use common_meta::peer::Peer;
 use common_telemetry::init_default_ut_logging;
 use store_api::region_engine::RegionRole;
-use store_api::storage::{FileId, FileRefsManifest, GcReport, RegionId};
+use store_api::storage::{FileId, FileRef, FileRefsManifest, GcReport, RegionId};

 use crate::gc::mock::{
    MockSchedulerCtx, TEST_REGION_SIZE_200MB, mock_region_stat, new_empty_report_with,
@@ -60,7 +60,10 @@ async fn test_gc_regions_failure_handling() {

    let file_refs = FileRefsManifest {
        manifest_version: HashMap::from([(region_id, 1)]),
-        file_refs: HashMap::from([(region_id, HashSet::from([FileId::random()]))]),
+        file_refs: HashMap::from([(
+            region_id,
+            HashSet::from([FileRef::new(region_id, FileId::random(), None)]),
+        )]),
    };

    let ctx = Arc::new(
--- a/src/meta-srv/src/gc/procedure.rs
+++ b/src/meta-srv/src/gc/procedure.rs
@@ -356,8 +356,7 @@ impl BatchGcProcedure {
        }

        // Send GetFileRefs instructions to each datanode
-        let mut all_file_refs: HashMap<RegionId, HashSet<store_api::storage::FileId>> =
-            HashMap::new();
+        let mut all_file_refs: HashMap<RegionId, HashSet<_>> = HashMap::new();
        let mut all_manifest_versions = HashMap::new();

        for (peer, regions) in datanode2query_regions {
--- a/src/meta-srv/src/metasrv.rs
+++ b/src/meta-srv/src/metasrv.rs
@@ -163,8 +163,6 @@ pub struct MetasrvOptions {
    pub backend_client: BackendClientOptions,
    /// The type of selector.
    pub selector: SelectorType,
-    /// Whether to use the memory store.
-    pub use_memory_store: bool,
    /// Whether to enable region failover.
    pub enable_region_failover: bool,
    /// The base heartbeat interval.
@@ -233,6 +231,9 @@ pub struct MetasrvOptions {
    #[cfg(feature = "pg_kvbackend")]
    /// Optional PostgreSQL schema for metadata table (defaults to current search_path if empty).
    pub meta_schema_name: Option<String>,
+    #[cfg(feature = "pg_kvbackend")]
+    /// Automatically create PostgreSQL schema if it doesn't exist (default: true).
+    pub auto_create_schema: bool,
    #[serde(with = "humantime_serde")]
    pub node_max_idle_time: Duration,
    /// The event recorder options.
@@ -250,7 +251,6 @@ impl fmt::Debug for MetasrvOptions {
            .field("store_addrs", &self.sanitize_store_addrs())
            .field("backend_tls", &self.backend_tls)
            .field("selector", &self.selector)
-            .field("use_memory_store", &self.use_memory_store)
            .field("enable_region_failover", &self.enable_region_failover)
            .field(
                "allow_region_failover_on_local_wal",
@@ -301,7 +301,6 @@ impl Default for MetasrvOptions {
            store_addrs: vec!["127.0.0.1:2379".to_string()],
            backend_tls: None,
            selector: SelectorType::default(),
-            use_memory_store: false,
            enable_region_failover: false,
            heartbeat_interval: distributed_time_constants::BASE_HEARTBEAT_INTERVAL,
            region_failure_detector_initialization_delay: Duration::from_secs(10 * 60),
@@ -337,6 +336,8 @@ impl Default for MetasrvOptions {
            meta_election_lock_id: common_meta::kv_backend::DEFAULT_META_ELECTION_LOCK_ID,
            #[cfg(feature = "pg_kvbackend")]
            meta_schema_name: None,
+            #[cfg(feature = "pg_kvbackend")]
+            auto_create_schema: true,
            node_max_idle_time: Duration::from_secs(24 * 60 * 60),
            event_recorder: EventRecorderOptions::default(),
            stats_persistence: StatsPersistenceOptions::default(),
--- a/src/meta-srv/src/procedure/repartition.rs
+++ b/src/meta-srv/src/procedure/repartition.rs
@@ -12,8 +12,63 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

+pub mod allocate_region;
+pub mod collect;
+pub mod deallocate_region;
+pub mod dispatch;
 pub mod group;
 pub mod plan;
+pub mod repartition_end;
+pub mod repartition_start;
+
+use std::any::Any;
+use std::fmt::Debug;
+
+use common_meta::cache_invalidator::CacheInvalidatorRef;
+use common_meta::key::TableMetadataManagerRef;
+use common_procedure::{Context as ProcedureContext, Status};
+use serde::{Deserialize, Serialize};
+use store_api::storage::TableId;
+
+use crate::error::Result;
+use crate::procedure::repartition::plan::RepartitionPlanEntry;
+use crate::service::mailbox::MailboxRef;

 #[cfg(test)]
 pub mod test_util;
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
+pub struct PersistentContext {
+    pub catalog_name: String,
+    pub schema_name: String,
+    pub table_name: String,
+    pub table_id: TableId,
+    pub plans: Vec<RepartitionPlanEntry>,
+}
+
+pub struct Context {
+    pub persistent_ctx: PersistentContext,
+    pub table_metadata_manager: TableMetadataManagerRef,
+    pub mailbox: MailboxRef,
+    pub server_addr: String,
+    pub cache_invalidator: CacheInvalidatorRef,
+}
+
+#[async_trait::async_trait]
+#[typetag::serde(tag = "repartition_state")]
+pub(crate) trait State: Sync + Send + Debug {
+    fn name(&self) -> &'static str {
+        let type_name = std::any::type_name::<Self>();
+        // short name
+        type_name.split("::").last().unwrap_or(type_name)
+    }
+
+    /// Yields the next [State] and [Status].
+    async fn next(
+        &mut self,
+        ctx: &mut Context,
+        procedure_ctx: &ProcedureContext,
+    ) -> Result<(Box<dyn State>, Status)>;
+
+    fn as_any(&self) -> &dyn Any;
+}
--- a/src/meta-srv/src/procedure/repartition/allocate_region.rs
+++ b/src/meta-srv/src/procedure/repartition/allocate_region.rs
@@ -0,0 +1,67 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::any::Any;
+
+use common_procedure::{Context as ProcedureContext, Status};
+use serde::{Deserialize, Serialize};
+
+use crate::error::Result;
+use crate::procedure::repartition::dispatch::Dispatch;
+use crate::procedure::repartition::plan::{AllocationPlanEntry, RepartitionPlanEntry};
+use crate::procedure::repartition::{Context, State};
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct AllocateRegion {
+    plan_entries: Vec<AllocationPlanEntry>,
+}
+
+impl AllocateRegion {
+    pub fn new(plan_entries: Vec<AllocationPlanEntry>) -> Self {
+        Self { plan_entries }
+    }
+}
+
+#[async_trait::async_trait]
+#[typetag::serde]
+impl State for AllocateRegion {
+    async fn next(
+        &mut self,
+        ctx: &mut Context,
+        _procedure_ctx: &ProcedureContext,
+    ) -> Result<(Box<dyn State>, Status)> {
+        let region_to_allocate = self
+            .plan_entries
+            .iter()
+            .map(|p| p.regions_to_allocate)
+            .sum::<usize>();
+
+        if region_to_allocate == 0 {
+            let repartition_plan_entries = self
+                .plan_entries
+                .iter()
+                .map(RepartitionPlanEntry::from_allocation_plan_entry)
+                .collect::<Vec<_>>();
+            ctx.persistent_ctx.plans = repartition_plan_entries;
+            return Ok((Box::new(Dispatch), Status::executing(true)));
+        }
+
+        // TODO(weny): allocate regions.
+        todo!()
+    }
+
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+}
--- a/src/meta-srv/src/procedure/repartition/collect.rs
+++ b/src/meta-srv/src/procedure/repartition/collect.rs
@@ -0,0 +1,106 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::any::Any;
+
+use common_procedure::{Context as ProcedureContext, ProcedureId, Status, watcher};
+use common_telemetry::error;
+use serde::{Deserialize, Serialize};
+use snafu::ResultExt;
+
+use crate::error::{RepartitionSubprocedureStateReceiverSnafu, Result};
+use crate::procedure::repartition::deallocate_region::DeallocateRegion;
+use crate::procedure::repartition::group::GroupId;
+use crate::procedure::repartition::{Context, State};
+
+/// Metadata for tracking a dispatched sub-procedure.
+#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq)]
+pub struct ProcedureMeta {
+    /// The index of the plan entry in the parent procedure's plan list.
+    pub plan_index: usize,
+    /// The group id of the repartition group.
+    pub group_id: GroupId,
+    /// The procedure id of the sub-procedure.
+    pub procedure_id: ProcedureId,
+}
+
+/// State for collecting results from dispatched sub-procedures.
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct Collect {
+    /// Sub-procedures that are currently in-flight.
+    pub inflight_procedures: Vec<ProcedureMeta>,
+    /// Sub-procedures that have completed successfully.
+    pub succeeded_procedures: Vec<ProcedureMeta>,
+    /// Sub-procedures that have failed.
+    pub failed_procedures: Vec<ProcedureMeta>,
+    /// Sub-procedures whose state could not be determined.
+    pub unknown_procedures: Vec<ProcedureMeta>,
+}
+
+impl Collect {
+    pub fn new(inflight_procedures: Vec<ProcedureMeta>) -> Self {
+        Self {
+            inflight_procedures,
+            succeeded_procedures: Vec::new(),
+            failed_procedures: Vec::new(),
+            unknown_procedures: Vec::new(),
+        }
+    }
+}
+
+#[async_trait::async_trait]
+#[typetag::serde]
+impl State for Collect {
+    async fn next(
+        &mut self,
+        _ctx: &mut Context,
+        procedure_ctx: &ProcedureContext,
+    ) -> Result<(Box<dyn State>, Status)> {
+        for procedure_meta in self.inflight_procedures.iter() {
+            let procedure_id = procedure_meta.procedure_id;
+            let group_id = procedure_meta.group_id;
+            let Some(mut receiver) = procedure_ctx
+                .provider
+                .procedure_state_receiver(procedure_id)
+                .await
+                .context(RepartitionSubprocedureStateReceiverSnafu { procedure_id })?
+            else {
+                error!(
+                    "failed to get procedure state receiver, procedure_id: {}, group_id: {}",
+                    procedure_id, group_id
+                );
+                self.unknown_procedures.push(*procedure_meta);
+                continue;
+            };
+
+            match watcher::wait(&mut receiver).await {
+                Ok(_) => self.succeeded_procedures.push(*procedure_meta),
+                Err(e) => {
+                    error!(e; "failed to wait for repartition subprocedure, procedure_id: {}, group_id: {}", procedure_id, group_id);
+                    self.failed_procedures.push(*procedure_meta);
+                }
+            }
+        }
+
+        if !self.failed_procedures.is_empty() || !self.unknown_procedures.is_empty() {
+            // TODO(weny): retry the failed or unknown procedures.
+        }
+
+        Ok((Box::new(DeallocateRegion), Status::executing(true)))
+    }
+
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+}
--- a/src/meta-srv/src/procedure/repartition/deallocate_region.rs
+++ b/src/meta-srv/src/procedure/repartition/deallocate_region.rs
@@ -0,0 +1,52 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::any::Any;
+
+use common_procedure::{Context as ProcedureContext, Status};
+use serde::{Deserialize, Serialize};
+
+use crate::error::Result;
+use crate::procedure::repartition::repartition_end::RepartitionEnd;
+use crate::procedure::repartition::{Context, State};
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct DeallocateRegion;
+
+#[async_trait::async_trait]
+#[typetag::serde]
+impl State for DeallocateRegion {
+    async fn next(
+        &mut self,
+        ctx: &mut Context,
+        _procedure_ctx: &ProcedureContext,
+    ) -> Result<(Box<dyn State>, Status)> {
+        let region_to_deallocate = ctx
+            .persistent_ctx
+            .plans
+            .iter()
+            .map(|p| p.pending_deallocate_region_ids.len())
+            .sum::<usize>();
+        if region_to_deallocate == 0 {
+            return Ok((Box::new(RepartitionEnd), Status::done()));
+        }
+
+        // TODO(weny): deallocate regions.
+        todo!()
+    }
+
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+}
--- a/src/meta-srv/src/procedure/repartition/dispatch.rs
+++ b/src/meta-srv/src/procedure/repartition/dispatch.rs
@@ -0,0 +1,66 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::any::Any;
+
+use common_procedure::{Context as ProcedureContext, ProcedureWithId, Status};
+use serde::{Deserialize, Serialize};
+
+use crate::error::Result;
+use crate::procedure::repartition::collect::{Collect, ProcedureMeta};
+use crate::procedure::repartition::group::RepartitionGroupProcedure;
+use crate::procedure::repartition::{self, Context, State};
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct Dispatch;
+
+#[async_trait::async_trait]
+#[typetag::serde]
+impl State for Dispatch {
+    async fn next(
+        &mut self,
+        ctx: &mut Context,
+        _procedure_ctx: &ProcedureContext,
+    ) -> Result<(Box<dyn State>, Status)> {
+        let table_id = ctx.persistent_ctx.table_id;
+        let mut procedures = Vec::with_capacity(ctx.persistent_ctx.plans.len());
+        let mut procedure_metas = Vec::with_capacity(ctx.persistent_ctx.plans.len());
+        for (plan_index, plan) in ctx.persistent_ctx.plans.iter().enumerate() {
+            let persistent_ctx = repartition::group::PersistentContext::new(
+                plan.group_id,
+                table_id,
+                plan.source_regions.clone(),
+                plan.target_regions.clone(),
+            );
+
+            let group_procedure = RepartitionGroupProcedure::new(persistent_ctx, ctx);
+            let procedure = ProcedureWithId::with_random_id(Box::new(group_procedure));
+            procedure_metas.push(ProcedureMeta {
+                plan_index,
+                group_id: plan.group_id,
+                procedure_id: procedure.id,
+            });
+            procedures.push(procedure);
+        }
+
+        Ok((
+            Box::new(Collect::new(procedure_metas)),
+            Status::suspended(procedures, true),
+        ))
+    }
+
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+}
--- a/src/meta-srv/src/procedure/repartition/group.rs
+++ b/src/meta-srv/src/procedure/repartition/group.rs
@@ -12,11 +12,14 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

+pub(crate) mod enter_staging_region;
 pub(crate) mod repartition_start;
 pub(crate) mod update_metadata;
+pub(crate) mod utils;

 use std::any::Any;
 use std::fmt::Debug;
+use std::time::Duration;

 use common_error::ext::BoxedError;
 use common_meta::DatanodeId;
@@ -26,18 +29,78 @@ use common_meta::key::datanode_table::{DatanodeTableKey, DatanodeTableValue, Reg
 use common_meta::key::table_route::TableRouteValue;
 use common_meta::key::{DeserializedValueWithBytes, TableMetadataManagerRef};
 use common_meta::rpc::router::RegionRoute;
-use common_procedure::{Context as ProcedureContext, Status};
+use common_procedure::{
+    Context as ProcedureContext, LockKey, Procedure, Result as ProcedureResult, Status,
+    UserMetadata,
+};
 use serde::{Deserialize, Serialize};
 use snafu::{OptionExt, ResultExt};
 use store_api::storage::{RegionId, TableId};
 use uuid::Uuid;

 use crate::error::{self, Result};
+use crate::procedure::repartition::group::repartition_start::RepartitionStart;
 use crate::procedure::repartition::plan::RegionDescriptor;
+use crate::procedure::repartition::{self};
+use crate::service::mailbox::MailboxRef;

 pub type GroupId = Uuid;

-pub struct RepartitionGroupProcedure {}
+#[allow(dead_code)]
+pub struct RepartitionGroupProcedure {
+    state: Box<dyn State>,
+    context: Context,
+}
+
+impl RepartitionGroupProcedure {
+    const TYPE_NAME: &'static str = "metasrv-procedure::RepartitionGroup";
+
+    pub fn new(persistent_context: PersistentContext, context: &repartition::Context) -> Self {
+        let state = Box::new(RepartitionStart);
+
+        Self {
+            state,
+            context: Context {
+                persistent_ctx: persistent_context,
+                cache_invalidator: context.cache_invalidator.clone(),
+                table_metadata_manager: context.table_metadata_manager.clone(),
+                mailbox: context.mailbox.clone(),
+                server_addr: context.server_addr.clone(),
+            },
+        }
+    }
+}
+
+#[async_trait::async_trait]
+impl Procedure for RepartitionGroupProcedure {
+    fn type_name(&self) -> &str {
+        Self::TYPE_NAME
+    }
+
+    async fn execute(&mut self, _ctx: &ProcedureContext) -> ProcedureResult<Status> {
+        todo!()
+    }
+
+    async fn rollback(&mut self, _: &ProcedureContext) -> ProcedureResult<()> {
+        todo!()
+    }
+
+    fn rollback_supported(&self) -> bool {
+        true
+    }
+
+    fn dump(&self) -> ProcedureResult<String> {
+        todo!()
+    }
+
+    fn lock_key(&self) -> LockKey {
+        todo!()
+    }
+
+    fn user_metadata(&self) -> Option<UserMetadata> {
+        todo!()
+    }
+}

 pub struct Context {
    pub persistent_ctx: PersistentContext,
@@ -45,13 +108,22 @@ pub struct Context {
    pub cache_invalidator: CacheInvalidatorRef,

    pub table_metadata_manager: TableMetadataManagerRef,
+
+    pub mailbox: MailboxRef,
+
+    pub server_addr: String,
 }

+/// The result of the group preparation phase, containing validated region routes.
 #[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
 pub struct GroupPrepareResult {
+    /// The validated source region routes.
    pub source_routes: Vec<RegionRoute>,
+    /// The validated target region routes.
    pub target_routes: Vec<RegionRoute>,
+    /// The primary source region id (first source region), used for retrieving region options.
    pub central_region: RegionId,
+    /// The datanode id where the primary source region is located.
    pub central_region_datanode_id: DatanodeId,
 }

@@ -69,6 +141,23 @@ pub struct PersistentContext {
    pub group_prepare_result: Option<GroupPrepareResult>,
 }

+impl PersistentContext {
+    pub fn new(
+        group_id: GroupId,
+        table_id: TableId,
+        sources: Vec<RegionDescriptor>,
+        targets: Vec<RegionDescriptor>,
+    ) -> Self {
+        Self {
+            group_id,
+            table_id,
+            sources,
+            targets,
+            group_prepare_result: None,
+        }
+    }
+}
+
 impl Context {
    /// Retrieves the table route value for the given table id.
    ///
@@ -184,6 +273,13 @@ impl Context {
            .await
            .context(error::TableMetadataManagerSnafu)
    }
+
+    /// Returns the next operation timeout.
+    ///
+    /// If the next operation timeout is not set, it will return `None`.
+    pub fn next_operation_timeout(&self) -> Option<Duration> {
+        Some(Duration::from_secs(10))
+    }
 }

 /// Returns the region routes of the given table route value.
--- a/src/meta-srv/src/procedure/repartition/group/enter_staging_region.rs
+++ b/src/meta-srv/src/procedure/repartition/group/enter_staging_region.rs
@@ -0,0 +1,717 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::any::Any;
+use std::collections::HashMap;
+use std::time::{Duration, Instant};
+
+use api::v1::meta::MailboxMessage;
+use common_meta::instruction::{
+    EnterStagingRegionReply, EnterStagingRegionsReply, Instruction, InstructionReply,
+};
+use common_meta::peer::Peer;
+use common_procedure::{Context as ProcedureContext, Status};
+use common_telemetry::info;
+use futures::future::join_all;
+use serde::{Deserialize, Serialize};
+use snafu::{OptionExt, ResultExt, ensure};
+
+use crate::error::{self, Error, Result};
+use crate::handler::HeartbeatMailbox;
+use crate::procedure::repartition::group::utils::{
+    HandleMultipleResult, group_region_routes_by_peer, handle_multiple_results,
+};
+use crate::procedure::repartition::group::{Context, GroupPrepareResult, State};
+use crate::procedure::repartition::plan::RegionDescriptor;
+use crate::service::mailbox::{Channel, MailboxRef};
+
+#[derive(Debug, Serialize, Deserialize)]
+pub struct EnterStagingRegion;
+
+#[async_trait::async_trait]
+#[typetag::serde]
+impl State for EnterStagingRegion {
+    async fn next(
+        &mut self,
+        ctx: &mut Context,
+        _procedure_ctx: &ProcedureContext,
+    ) -> Result<(Box<dyn State>, Status)> {
+        self.enter_staging_regions(ctx).await?;
+
+        Ok(Self::next_state())
+    }
+
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+}
+
+impl EnterStagingRegion {
+    #[allow(dead_code)]
+    fn next_state() -> (Box<dyn State>, Status) {
+        // TODO(weny): change it later.
+        (Box::new(EnterStagingRegion), Status::executing(true))
+    }
+
+    fn build_enter_staging_instructions(
+        prepare_result: &GroupPrepareResult,
+        targets: &[RegionDescriptor],
+    ) -> Result<HashMap<Peer, Instruction>> {
+        let target_partition_expr_by_region = targets
+            .iter()
+            .map(|target| {
+                Ok((
+                    target.region_id,
+                    target
+                        .partition_expr
+                        .as_json_str()
+                        .context(error::SerializePartitionExprSnafu)?,
+                ))
+            })
+            .collect::<Result<HashMap<_, _>>>()?;
+        // Safety: `leader_peer` is set for all region routes, checked in `repartition_start`.
+        let target_region_routes_by_peer =
+            group_region_routes_by_peer(&prepare_result.target_routes);
+        let mut instructions = HashMap::with_capacity(target_region_routes_by_peer.len());
+        for (peer, region_ids) in target_region_routes_by_peer {
+            let enter_staging_regions = region_ids
+                .into_iter()
+                .map(|region_id| common_meta::instruction::EnterStagingRegion {
+                    region_id,
+                    // Safety: the target_routes is constructed from the targets, so the region_id is always present in the map.
+                    partition_expr: target_partition_expr_by_region[&region_id].clone(),
+                })
+                .collect();
+            instructions.insert(
+                peer.clone(),
+                Instruction::EnterStagingRegions(enter_staging_regions),
+            );
+        }
+
+        Ok(instructions)
+    }
+
+    #[allow(dead_code)]
+    async fn enter_staging_regions(&self, ctx: &mut Context) -> Result<()> {
+        let table_id = ctx.persistent_ctx.table_id;
+        let group_id = ctx.persistent_ctx.group_id;
+        // Safety: the group prepare result is set in the RepartitionStart state.
+        let prepare_result = ctx.persistent_ctx.group_prepare_result.as_ref().unwrap();
+        let targets = &ctx.persistent_ctx.targets;
+        let instructions = Self::build_enter_staging_instructions(prepare_result, targets)?;
+        let operation_timeout =
+            ctx.next_operation_timeout()
+                .context(error::ExceededDeadlineSnafu {
+                    operation: "Enter staging regions",
+                })?;
+        let (peers, tasks): (Vec<_>, Vec<_>) = instructions
+            .iter()
+            .map(|(peer, instruction)| {
+                (
+                    peer,
+                    Self::enter_staging_region(
+                        &ctx.mailbox,
+                        &ctx.server_addr,
+                        peer,
+                        instruction,
+                        operation_timeout,
+                    ),
+                )
+            })
+            .unzip();
+        info!(
+            "Sent enter staging regions instructions to peers: {:?} for repartition table {}, group id {}",
+            peers, table_id, group_id
+        );
+
+        let format_err_msg = |idx: usize, error: &Error| {
+            let peer = peers[idx];
+            format!(
+                "Failed to enter staging regions on datanode {:?}, error: {:?}",
+                peer, error
+            )
+        };
+        // Waits for all tasks to complete.
+        let results = join_all(tasks).await;
+        let result = handle_multiple_results(&results);
+        match result {
+            HandleMultipleResult::AllSuccessful => Ok(()),
+            HandleMultipleResult::AllRetryable(retryable_errors) => error::RetryLaterSnafu {
+                reason: format!(
+                    "All retryable errors during entering staging regions for repartition table {}, group id {}: {:?}",
+                    table_id, group_id,
+                    retryable_errors
+                        .iter()
+                        .map(|(idx, error)| format_err_msg(*idx, error))
+                        .collect::<Vec<_>>()
+                        .join(",")
+                ),
+            }
+            .fail(),
+            HandleMultipleResult::AllNonRetryable(non_retryable_errors) => error::UnexpectedSnafu {
+                violated: format!(
+                    "All non retryable errors during entering staging regions for repartition table {}, group id {}: {:?}",
+                    table_id, group_id,
+                    non_retryable_errors
+                        .iter()
+                        .map(|(idx, error)| format_err_msg(*idx, error))
+                        .collect::<Vec<_>>()
+                        .join(",")
+                ),
+            }
+            .fail(),
+            HandleMultipleResult::PartialRetryable {
+                retryable_errors,
+                non_retryable_errors,
+            } => error::UnexpectedSnafu {
+                violated: format!(
+                    "Partial retryable errors during entering staging regions for repartition table {}, group id {}: {:?}, non retryable errors: {:?}",
+                    table_id, group_id,
+                    retryable_errors
+                        .iter()
+                        .map(|(idx, error)| format_err_msg(*idx, error))
+                        .collect::<Vec<_>>()
+                        .join(","),
+                    non_retryable_errors
+                        .iter()
+                        .map(|(idx, error)| format_err_msg(*idx, error))
+                        .collect::<Vec<_>>()
+                        .join(","),
+                ),
+            }
+            .fail(),
+        }
+    }
+
+    /// Enter staging region on a datanode.
+    ///
+    /// Retry:
+    /// - Pusher is not found.
+    /// - Mailbox timeout.
+    ///
+    /// Abort(non-retry):
+    /// - Unexpected instruction reply.
+    /// - Exceeded deadline of enter staging regions instruction.
+    /// - Target region doesn't exist on the datanode.
+    async fn enter_staging_region(
+        mailbox: &MailboxRef,
+        server_addr: &str,
+        peer: &Peer,
+        instruction: &Instruction,
+        timeout: Duration,
+    ) -> Result<()> {
+        let ch = Channel::Datanode(peer.id);
+        let message = MailboxMessage::json_message(
+            &format!("Enter staging regions: {:?}", instruction),
+            &format!("Metasrv@{}", server_addr),
+            &format!("Datanode-{}@{}", peer.id, peer.addr),
+            common_time::util::current_time_millis(),
+            &instruction,
+        )
+        .with_context(|_| error::SerializeToJsonSnafu {
+            input: instruction.to_string(),
+        })?;
+        let now = Instant::now();
+        let receiver = mailbox.send(&ch, message, timeout).await;
+
+        let receiver = match receiver {
+            Ok(receiver) => receiver,
+            Err(error::Error::PusherNotFound { .. }) => error::RetryLaterSnafu {
+                reason: format!(
+                    "Pusher not found for enter staging regions on datanode {:?}, elapsed: {:?}",
+                    peer,
+                    now.elapsed()
+                ),
+            }
+            .fail()?,
+            Err(err) => {
+                return Err(err);
+            }
+        };
+
+        match receiver.await {
+            Ok(msg) => {
+                let reply = HeartbeatMailbox::json_reply(&msg)?;
+                info!(
+                    "Received enter staging regions reply: {:?}, elapsed: {:?}",
+                    reply,
+                    now.elapsed()
+                );
+                let InstructionReply::EnterStagingRegions(EnterStagingRegionsReply { replies }) =
+                    reply
+                else {
+                    return error::UnexpectedInstructionReplySnafu {
+                        mailbox_message: msg.to_string(),
+                        reason: "expect enter staging regions reply",
+                    }
+                    .fail();
+                };
+                for reply in replies {
+                    Self::handle_enter_staging_region_reply(&reply, &now, peer)?;
+                }
+
+                Ok(())
+            }
+            Err(error::Error::MailboxTimeout { .. }) => {
+                let reason = format!(
+                    "Mailbox received timeout for enter staging regions on datanode {:?}, elapsed: {:?}",
+                    peer,
+                    now.elapsed()
+                );
+                error::RetryLaterSnafu { reason }.fail()
+            }
+            Err(err) => Err(err),
+        }
+    }
+
+    fn handle_enter_staging_region_reply(
+        EnterStagingRegionReply {
+            region_id,
+            ready,
+            exists,
+            error,
+        }: &EnterStagingRegionReply,
+        now: &Instant,
+        peer: &Peer,
+    ) -> Result<()> {
+        ensure!(
+            exists,
+            error::UnexpectedSnafu {
+                violated: format!(
+                    "Region {} doesn't exist on datanode {:?}, elapsed: {:?}",
+                    region_id,
+                    peer,
+                    now.elapsed()
+                )
+            }
+        );
+
+        if error.is_some() {
+            return error::RetryLaterSnafu {
+                reason: format!(
+                    "Failed to enter staging region {} on datanode {:?}, error: {:?}, elapsed: {:?}",
+                    region_id, peer, error, now.elapsed()
+                ),
+            }
+            .fail();
+        }
+
+        ensure!(
+            ready,
+            error::RetryLaterSnafu {
+                reason: format!(
+                    "Region {} is still entering staging state on datanode {:?}, elapsed: {:?}",
+                    region_id,
+                    peer,
+                    now.elapsed()
+                ),
+            }
+        );
+
+        Ok(())
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::assert_matches::assert_matches;
+    use std::time::Duration;
+
+    use common_meta::instruction::Instruction;
+    use common_meta::peer::Peer;
+    use common_meta::rpc::router::{Region, RegionRoute};
+    use store_api::storage::RegionId;
+
+    use crate::error::{self, Error};
+    use crate::procedure::repartition::group::GroupPrepareResult;
+    use crate::procedure::repartition::group::enter_staging_region::EnterStagingRegion;
+    use crate::procedure::repartition::plan::RegionDescriptor;
+    use crate::procedure::repartition::test_util::{
+        TestingEnv, new_persistent_context, range_expr,
+    };
+    use crate::procedure::test_util::{
+        new_close_region_reply, new_enter_staging_region_reply, send_mock_reply,
+    };
+    use crate::service::mailbox::Channel;
+
+    #[test]
+    fn test_build_enter_staging_instructions() {
+        let table_id = 1024;
+        let prepare_result = GroupPrepareResult {
+            source_routes: vec![RegionRoute {
+                region: Region {
+                    id: RegionId::new(table_id, 1),
+                    ..Default::default()
+                },
+                leader_peer: Some(Peer::empty(1)),
+                ..Default::default()
+            }],
+            target_routes: vec![
+                RegionRoute {
+                    region: Region {
+                        id: RegionId::new(table_id, 1),
+                        ..Default::default()
+                    },
+                    leader_peer: Some(Peer::empty(1)),
+                    ..Default::default()
+                },
+                RegionRoute {
+                    region: Region {
+                        id: RegionId::new(table_id, 2),
+                        ..Default::default()
+                    },
+                    leader_peer: Some(Peer::empty(2)),
+                    ..Default::default()
+                },
+            ],
+            central_region: RegionId::new(table_id, 1),
+            central_region_datanode_id: 1,
+        };
+        let targets = test_targets();
+        let instructions =
+            EnterStagingRegion::build_enter_staging_instructions(&prepare_result, &targets)
+                .unwrap();
+
+        assert_eq!(instructions.len(), 2);
+        let instruction_1 = instructions
+            .get(&Peer::empty(1))
+            .unwrap()
+            .clone()
+            .into_enter_staging_regions()
+            .unwrap();
+        assert_eq!(
+            instruction_1,
+            vec![common_meta::instruction::EnterStagingRegion {
+                region_id: RegionId::new(table_id, 1),
+                partition_expr: range_expr("x", 0, 10).as_json_str().unwrap(),
+            }]
+        );
+        let instruction_2 = instructions
+            .get(&Peer::empty(2))
+            .unwrap()
+            .clone()
+            .into_enter_staging_regions()
+            .unwrap();
+        assert_eq!(
+            instruction_2,
+            vec![common_meta::instruction::EnterStagingRegion {
+                region_id: RegionId::new(table_id, 2),
+                partition_expr: range_expr("x", 10, 20).as_json_str().unwrap(),
+            }]
+        );
+    }
+
+    #[tokio::test]
+    async fn test_datanode_is_unreachable() {
+        let env = TestingEnv::new();
+        let server_addr = "localhost";
+        let peer = Peer::empty(1);
+        let instruction =
+            Instruction::EnterStagingRegions(vec![common_meta::instruction::EnterStagingRegion {
+                region_id: RegionId::new(1024, 1),
+                partition_expr: range_expr("x", 0, 10).as_json_str().unwrap(),
+            }]);
+        let timeout = Duration::from_secs(10);
+
+        let err = EnterStagingRegion::enter_staging_region(
+            env.mailbox_ctx.mailbox(),
+            server_addr,
+            &peer,
+            &instruction,
+            timeout,
+        )
+        .await
+        .unwrap_err();
+
+        assert_matches!(err, Error::RetryLater { .. });
+        assert!(err.is_retryable());
+    }
+
+    #[tokio::test]
+    async fn test_enter_staging_region_exceeded_deadline() {
+        let mut env = TestingEnv::new();
+        let (tx, rx) = tokio::sync::mpsc::channel(1);
+        env.mailbox_ctx
+            .insert_heartbeat_response_receiver(Channel::Datanode(1), tx)
+            .await;
+        let server_addr = "localhost";
+        let peer = Peer::empty(1);
+        let instruction =
+            Instruction::EnterStagingRegions(vec![common_meta::instruction::EnterStagingRegion {
+                region_id: RegionId::new(1024, 1),
+                partition_expr: range_expr("x", 0, 10).as_json_str().unwrap(),
+            }]);
+        let timeout = Duration::from_secs(10);
+
+        // Sends a timeout error.
+        send_mock_reply(env.mailbox_ctx.mailbox().clone(), rx, |id| {
+            Err(error::MailboxTimeoutSnafu { id }.build())
+        });
+
+        let err = EnterStagingRegion::enter_staging_region(
+            env.mailbox_ctx.mailbox(),
+            server_addr,
+            &peer,
+            &instruction,
+            timeout,
+        )
+        .await
+        .unwrap_err();
+        assert_matches!(err, Error::RetryLater { .. });
+        assert!(err.is_retryable());
+    }
+
+    #[tokio::test]
+    async fn test_unexpected_instruction_reply() {
+        let mut env = TestingEnv::new();
+        let (tx, rx) = tokio::sync::mpsc::channel(1);
+
+        let server_addr = "localhost";
+        let peer = Peer::empty(1);
+        let instruction =
+            Instruction::EnterStagingRegions(vec![common_meta::instruction::EnterStagingRegion {
+                region_id: RegionId::new(1024, 1),
+                partition_expr: range_expr("x", 0, 10).as_json_str().unwrap(),
+            }]);
+        let timeout = Duration::from_secs(10);
+
+        env.mailbox_ctx
+            .insert_heartbeat_response_receiver(Channel::Datanode(1), tx)
+            .await;
+        // Sends an incorrect reply.
+        send_mock_reply(env.mailbox_ctx.mailbox().clone(), rx, |id| {
+            Ok(new_close_region_reply(id))
+        });
+
+        let err = EnterStagingRegion::enter_staging_region(
+            env.mailbox_ctx.mailbox(),
+            server_addr,
+            &peer,
+            &instruction,
+            timeout,
+        )
+        .await
+        .unwrap_err();
+        assert_matches!(err, Error::UnexpectedInstructionReply { .. });
+        assert!(!err.is_retryable());
+    }
+
+    #[tokio::test]
+    async fn test_enter_staging_region_failed_to_enter_staging_state() {
+        let mut env = TestingEnv::new();
+        let (tx, rx) = tokio::sync::mpsc::channel(1);
+        env.mailbox_ctx
+            .insert_heartbeat_response_receiver(Channel::Datanode(1), tx)
+            .await;
+        let server_addr = "localhost";
+        let peer = Peer::empty(1);
+        let instruction =
+            Instruction::EnterStagingRegions(vec![common_meta::instruction::EnterStagingRegion {
+                region_id: RegionId::new(1024, 1),
+                partition_expr: range_expr("x", 0, 10).as_json_str().unwrap(),
+            }]);
+        let timeout = Duration::from_secs(10);
+
+        // Sends a failed reply.
+        send_mock_reply(env.mailbox_ctx.mailbox().clone(), rx, |id| {
+            Ok(new_enter_staging_region_reply(
+                id,
+                RegionId::new(1024, 1),
+                false,
+                true,
+                Some("test mocked".to_string()),
+            ))
+        });
+
+        let err = EnterStagingRegion::enter_staging_region(
+            env.mailbox_ctx.mailbox(),
+            server_addr,
+            &peer,
+            &instruction,
+            timeout,
+        )
+        .await
+        .unwrap_err();
+        assert_matches!(err, Error::RetryLater { .. });
+        assert!(err.is_retryable());
+
+        let (tx, rx) = tokio::sync::mpsc::channel(1);
+        env.mailbox_ctx
+            .insert_heartbeat_response_receiver(Channel::Datanode(1), tx)
+            .await;
+        // Region doesn't exist on the datanode.
+        send_mock_reply(env.mailbox_ctx.mailbox().clone(), rx, |id| {
+            Ok(new_enter_staging_region_reply(
+                id,
+                RegionId::new(1024, 1),
+                false,
+                false,
+                None,
+            ))
+        });
+
+        let err = EnterStagingRegion::enter_staging_region(
+            env.mailbox_ctx.mailbox(),
+            server_addr,
+            &peer,
+            &instruction,
+            timeout,
+        )
+        .await
+        .unwrap_err();
+        assert_matches!(err, Error::Unexpected { .. });
+        assert!(!err.is_retryable());
+    }
+
+    fn test_prepare_result(table_id: u32) -> GroupPrepareResult {
+        GroupPrepareResult {
+            source_routes: vec![],
+            target_routes: vec![
+                RegionRoute {
+                    region: Region {
+                        id: RegionId::new(table_id, 1),
+                        ..Default::default()
+                    },
+                    leader_peer: Some(Peer::empty(1)),
+                    ..Default::default()
+                },
+                RegionRoute {
+                    region: Region {
+                        id: RegionId::new(table_id, 2),
+                        ..Default::default()
+                    },
+                    leader_peer: Some(Peer::empty(2)),
+                    ..Default::default()
+                },
+            ],
+            central_region: RegionId::new(table_id, 1),
+            central_region_datanode_id: 1,
+        }
+    }
+
+    fn test_targets() -> Vec<RegionDescriptor> {
+        vec![
+            RegionDescriptor {
+                region_id: RegionId::new(1024, 1),
+                partition_expr: range_expr("x", 0, 10),
+            },
+            RegionDescriptor {
+                region_id: RegionId::new(1024, 2),
+                partition_expr: range_expr("x", 10, 20),
+            },
+        ]
+    }
+
+    #[tokio::test]
+    async fn test_enter_staging_regions_all_successful() {
+        let mut env = TestingEnv::new();
+        let table_id = 1024;
+        let targets = test_targets();
+        let mut persistent_context = new_persistent_context(table_id, vec![], targets);
+        persistent_context.group_prepare_result = Some(test_prepare_result(table_id));
+
+        let (tx, rx) = tokio::sync::mpsc::channel(1);
+        env.mailbox_ctx
+            .insert_heartbeat_response_receiver(Channel::Datanode(1), tx)
+            .await;
+        send_mock_reply(env.mailbox_ctx.mailbox().clone(), rx, |id| {
+            Ok(new_enter_staging_region_reply(
+                id,
+                RegionId::new(1024, 1),
+                true,
+                true,
+                None,
+            ))
+        });
+        let (tx, rx) = tokio::sync::mpsc::channel(1);
+        env.mailbox_ctx
+            .insert_heartbeat_response_receiver(Channel::Datanode(2), tx)
+            .await;
+        send_mock_reply(env.mailbox_ctx.mailbox().clone(), rx, |id| {
+            Ok(new_enter_staging_region_reply(
+                id,
+                RegionId::new(1024, 2),
+                true,
+                true,
+                None,
+            ))
+        });
+        let mut ctx = env.create_context(persistent_context);
+        EnterStagingRegion
+            .enter_staging_regions(&mut ctx)
+            .await
+            .unwrap();
+    }
+
+    #[tokio::test]
+    async fn test_enter_staging_region_retryable() {
+        let env = TestingEnv::new();
+        let table_id = 1024;
+        let targets = test_targets();
+        let mut persistent_context = new_persistent_context(table_id, vec![], targets);
+        persistent_context.group_prepare_result = Some(test_prepare_result(table_id));
+        let mut ctx = env.create_context(persistent_context);
+        let err = EnterStagingRegion
+            .enter_staging_regions(&mut ctx)
+            .await
+            .unwrap_err();
+        assert_matches!(err, Error::RetryLater { .. });
+        assert!(err.is_retryable());
+    }
+
+    #[tokio::test]
+    async fn test_enter_staging_regions_non_retryable() {
+        let mut env = TestingEnv::new();
+        let table_id = 1024;
+        let targets = test_targets();
+        let mut persistent_context = new_persistent_context(table_id, vec![], targets);
+        persistent_context.group_prepare_result = Some(test_prepare_result(table_id));
+        let (tx, rx) = tokio::sync::mpsc::channel(1);
+        env.mailbox_ctx
+            .insert_heartbeat_response_receiver(Channel::Datanode(1), tx)
+            .await;
+        // Sends an incorrect reply.
+        send_mock_reply(env.mailbox_ctx.mailbox().clone(), rx, |id| {
+            Ok(new_close_region_reply(id))
+        });
+
+        let mut ctx = env.create_context(persistent_context.clone());
+        // Datanode 1 returns unexpected reply.
+        // Datanode 2 is unreachable.
+        let err = EnterStagingRegion
+            .enter_staging_regions(&mut ctx)
+            .await
+            .unwrap_err();
+        assert_matches!(err, Error::Unexpected { .. });
+        assert!(!err.is_retryable());
+
+        let (tx, rx) = tokio::sync::mpsc::channel(1);
+        env.mailbox_ctx
+            .insert_heartbeat_response_receiver(Channel::Datanode(2), tx)
+            .await;
+        // Sends an incorrect reply.
+        send_mock_reply(env.mailbox_ctx.mailbox().clone(), rx, |id| {
+            Ok(new_close_region_reply(id))
+        });
+        let mut ctx = env.create_context(persistent_context);
+        // Datanode 1 returns unexpected reply.
+        // Datanode 2 returns unexpected reply.
+        let err = EnterStagingRegion
+            .enter_staging_regions(&mut ctx)
+            .await
+            .unwrap_err();
+        assert_matches!(err, Error::Unexpected { .. });
+        assert!(!err.is_retryable());
+    }
+}
--- a/src/meta-srv/src/procedure/repartition/group/repartition_start.rs
+++ b/src/meta-srv/src/procedure/repartition/group/repartition_start.rs
@@ -97,6 +97,17 @@ impl RepartitionStart {
                    .map(|r| (*r).clone())
            })
            .collect::<Result<Vec<_>>>()?;
+        for target_region_route in &target_region_routes {
+            ensure!(
+                target_region_route.leader_peer.is_some(),
+                error::UnexpectedSnafu {
+                    violated: format!(
+                        "Leader peer is not set for region: {}",
+                        target_region_route.region.id
+                    ),
+                }
+            );
+        }
        let central_region = sources[0].region_id;
        let central_region_datanode_id = source_region_routes[0]
            .leader_peer
--- a/src/meta-srv/src/procedure/repartition/group/utils.rs
+++ b/src/meta-srv/src/procedure/repartition/group/utils.rs
@@ -0,0 +1,88 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::collections::HashMap;
+
+use common_meta::peer::Peer;
+use common_meta::rpc::router::RegionRoute;
+use store_api::storage::RegionId;
+
+use crate::error::{Error, Result};
+
+/// Groups the region routes by the leader peer.
+///
+/// # Panics
+///
+/// Panics if the leader peer is not set for any of the region routes.
+pub(crate) fn group_region_routes_by_peer(
+    region_routes: &[RegionRoute],
+) -> HashMap<&Peer, Vec<RegionId>> {
+    let mut map: HashMap<&Peer, Vec<RegionId>> = HashMap::new();
+    for region_route in region_routes {
+        map.entry(region_route.leader_peer.as_ref().unwrap())
+            .or_default()
+            .push(region_route.region.id);
+    }
+    map
+}
+
+/// Returns `true` if all results are successful.
+fn all_successful(results: &[Result<()>]) -> bool {
+    results.iter().all(Result::is_ok)
+}
+
+pub enum HandleMultipleResult<'a> {
+    AllSuccessful,
+    AllRetryable(Vec<(usize, &'a Error)>),
+    PartialRetryable {
+        retryable_errors: Vec<(usize, &'a Error)>,
+        non_retryable_errors: Vec<(usize, &'a Error)>,
+    },
+    AllNonRetryable(Vec<(usize, &'a Error)>),
+}
+
+/// Evaluates results from multiple operations and categorizes errors by retryability.
+///
+/// If all operations succeed, returns `AllSuccessful`.
+/// If all errors are retryable, returns `AllRetryable`.
+/// If all errors are non-retryable, returns `AllNonRetryable`.
+/// Otherwise, returns `PartialRetryable` with separate collections for retryable and non-retryable errors.
+pub(crate) fn handle_multiple_results<'a>(results: &'a [Result<()>]) -> HandleMultipleResult<'a> {
+    if all_successful(results) {
+        return HandleMultipleResult::AllSuccessful;
+    }
+
+    let mut retryable_errors = Vec::new();
+    let mut non_retryable_errors = Vec::new();
+    for (index, result) in results.iter().enumerate() {
+        if let Err(error) = result {
+            if error.is_retryable() {
+                retryable_errors.push((index, error));
+            } else {
+                non_retryable_errors.push((index, error));
+            }
+        }
+    }
+
+    match (retryable_errors.is_empty(), non_retryable_errors.is_empty()) {
+        (true, false) => HandleMultipleResult::AllNonRetryable(non_retryable_errors),
+        (false, true) => HandleMultipleResult::AllRetryable(retryable_errors),
+        (false, false) => HandleMultipleResult::PartialRetryable {
+            retryable_errors,
+            non_retryable_errors,
+        },
+        // Should not happen, but include for completeness
+        (true, true) => HandleMultipleResult::AllSuccessful,
+    }
+}
--- a/src/meta-srv/src/procedure/repartition/plan.rs
+++ b/src/meta-srv/src/procedure/repartition/plan.rs
@@ -16,11 +16,79 @@ use partition::expr::PartitionExpr;
 use serde::{Deserialize, Serialize};
 use store_api::storage::RegionId;

+use crate::procedure::repartition::group::GroupId;
+
 /// Metadata describing a region involved in the plan.
 #[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
 pub struct RegionDescriptor {
    /// The region id of the region involved in the plan.
    pub region_id: RegionId,
-    /// The new partition expression of the region.
+    /// The partition expression of the region.
    pub partition_expr: PartitionExpr,
 }
+
+/// A plan entry for the region allocation phase, describing source regions
+/// and target partition expressions before allocation.
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
+pub struct AllocationPlanEntry {
+    /// The group id for this plan entry.
+    pub group_id: GroupId,
+    /// Source region descriptors involved in the plan.
+    pub source_regions: Vec<RegionDescriptor>,
+    /// The target partition expressions for the new or changed regions.
+    pub target_partition_exprs: Vec<PartitionExpr>,
+    /// The number of regions that need to be allocated (target count - source count, if positive).
+    pub regions_to_allocate: usize,
+    /// The number of regions that need to be deallocated (source count - target count, if positive).
+    pub regions_to_deallocate: usize,
+}
+
+/// A plan entry for the dispatch phase after region allocation,
+/// with concrete source and target region descriptors.
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)]
+pub struct RepartitionPlanEntry {
+    /// The group id for this plan entry.
+    pub group_id: GroupId,
+    /// The source region descriptors involved in the plan.
+    pub source_regions: Vec<RegionDescriptor>,
+    /// The target region descriptors involved in the plan.
+    pub target_regions: Vec<RegionDescriptor>,
+    /// The region ids of the allocated regions.
+    pub allocated_region_ids: Vec<RegionId>,
+    /// The region ids of the regions that are pending deallocation.
+    pub pending_deallocate_region_ids: Vec<RegionId>,
+}
+
+impl RepartitionPlanEntry {
+    /// Converts an allocation plan entry into a repartition plan entry.
+    ///
+    /// The target regions are derived from the source regions and the target partition expressions.
+    /// The allocated region ids and pending deallocate region ids are empty.
+    pub fn from_allocation_plan_entry(
+        AllocationPlanEntry {
+            group_id,
+            source_regions,
+            target_partition_exprs,
+            regions_to_allocate,
+            regions_to_deallocate,
+        }: &AllocationPlanEntry,
+    ) -> Self {
+        debug_assert!(*regions_to_allocate == 0 && *regions_to_deallocate == 0);
+        let target_regions = source_regions
+            .iter()
+            .zip(target_partition_exprs.iter())
+            .map(|(source_region, target_partition_expr)| RegionDescriptor {
+                region_id: source_region.region_id,
+                partition_expr: target_partition_expr.clone(),
+            })
+            .collect::<Vec<_>>();
+
+        Self {
+            group_id: *group_id,
+            source_regions: source_regions.clone(),
+            target_regions,
+            allocated_region_ids: vec![],
+            pending_deallocate_region_ids: vec![],
+        }
+    }
+}
--- a/src/meta-srv/src/procedure/repartition/repartition_end.rs
+++ b/src/meta-srv/src/procedure/repartition/repartition_end.rs
@@ -0,0 +1,40 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::any::Any;
+
+use common_procedure::{Context as ProcedureContext, Status};
+use serde::{Deserialize, Serialize};
+
+use crate::error::Result;
+use crate::procedure::repartition::{Context, State};
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct RepartitionEnd;
+
+#[async_trait::async_trait]
+#[typetag::serde]
+impl State for RepartitionEnd {
+    async fn next(
+        &mut self,
+        _ctx: &mut Context,
+        _procedure_ctx: &ProcedureContext,
+    ) -> Result<(Box<dyn State>, Status)> {
+        Ok((Box::new(RepartitionEnd), Status::done()))
+    }
+
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+}
--- a/src/meta-srv/src/procedure/repartition/repartition_start.rs
+++ b/src/meta-srv/src/procedure/repartition/repartition_start.rs
@@ -0,0 +1,172 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::any::Any;
+
+use common_meta::key::table_route::PhysicalTableRouteValue;
+use common_procedure::{Context as ProcedureContext, Status};
+use partition::expr::PartitionExpr;
+use partition::subtask::{self, RepartitionSubtask};
+use serde::{Deserialize, Serialize};
+use snafu::{OptionExt, ResultExt};
+use uuid::Uuid;
+
+use crate::error::{self, Result};
+use crate::procedure::repartition::allocate_region::AllocateRegion;
+use crate::procedure::repartition::plan::{AllocationPlanEntry, RegionDescriptor};
+use crate::procedure::repartition::repartition_end::RepartitionEnd;
+use crate::procedure::repartition::{Context, State};
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct RepartitionStart {
+    from_exprs: Vec<PartitionExpr>,
+    to_exprs: Vec<PartitionExpr>,
+}
+
+impl RepartitionStart {
+    pub fn new(from_exprs: Vec<PartitionExpr>, to_exprs: Vec<PartitionExpr>) -> Self {
+        Self {
+            from_exprs,
+            to_exprs,
+        }
+    }
+}
+
+#[async_trait::async_trait]
+#[typetag::serde]
+impl State for RepartitionStart {
+    async fn next(
+        &mut self,
+        ctx: &mut Context,
+        _: &ProcedureContext,
+    ) -> Result<(Box<dyn State>, Status)> {
+        let (_, table_route) = ctx
+            .table_metadata_manager
+            .table_route_manager()
+            .get_physical_table_route(ctx.persistent_ctx.table_id)
+            .await
+            .context(error::TableMetadataManagerSnafu)?;
+
+        let plans = Self::build_plan(&table_route, &self.from_exprs, &self.to_exprs)?;
+
+        if plans.is_empty() {
+            return Ok((Box::new(RepartitionEnd), Status::done()));
+        }
+
+        Ok((
+            Box::new(AllocateRegion::new(plans)),
+            Status::executing(false),
+        ))
+    }
+
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+}
+
+impl RepartitionStart {
+    #[allow(dead_code)]
+    fn build_plan(
+        physical_route: &PhysicalTableRouteValue,
+        from_exprs: &[PartitionExpr],
+        to_exprs: &[PartitionExpr],
+    ) -> Result<Vec<AllocationPlanEntry>> {
+        let subtasks = subtask::create_subtasks(from_exprs, to_exprs)
+            .context(error::RepartitionCreateSubtasksSnafu)?;
+        if subtasks.is_empty() {
+            return Ok(vec![]);
+        }
+
+        let src_descriptors = Self::source_region_descriptors(from_exprs, physical_route)?;
+        Ok(Self::build_plan_entries(
+            subtasks,
+            &src_descriptors,
+            to_exprs,
+        ))
+    }
+
+    #[allow(dead_code)]
+    fn build_plan_entries(
+        subtasks: Vec<RepartitionSubtask>,
+        source_index: &[RegionDescriptor],
+        target_exprs: &[PartitionExpr],
+    ) -> Vec<AllocationPlanEntry> {
+        subtasks
+            .into_iter()
+            .map(|subtask| {
+                let group_id = Uuid::new_v4();
+                let source_regions = subtask
+                    .from_expr_indices
+                    .iter()
+                    .map(|&idx| source_index[idx].clone())
+                    .collect::<Vec<_>>();
+
+                let target_partition_exprs = subtask
+                    .to_expr_indices
+                    .iter()
+                    .map(|&idx| target_exprs[idx].clone())
+                    .collect::<Vec<_>>();
+                let regions_to_allocate = target_partition_exprs
+                    .len()
+                    .saturating_sub(source_regions.len());
+                let regions_to_deallocate = source_regions
+                    .len()
+                    .saturating_sub(target_partition_exprs.len());
+                AllocationPlanEntry {
+                    group_id,
+                    source_regions,
+                    target_partition_exprs,
+                    regions_to_allocate,
+                    regions_to_deallocate,
+                }
+            })
+            .collect::<Vec<_>>()
+    }
+
+    fn source_region_descriptors(
+        from_exprs: &[PartitionExpr],
+        physical_route: &PhysicalTableRouteValue,
+    ) -> Result<Vec<RegionDescriptor>> {
+        let existing_regions = physical_route
+            .region_routes
+            .iter()
+            .map(|route| (route.region.id, route.region.partition_expr()))
+            .collect::<Vec<_>>();
+
+        let descriptors = from_exprs
+            .iter()
+            .map(|expr| {
+                let expr_json = expr
+                    .as_json_str()
+                    .context(error::SerializePartitionExprSnafu)?;
+
+                let matched_region_id = existing_regions
+                    .iter()
+                    .find_map(|(region_id, existing_expr)| {
+                        (existing_expr == &expr_json).then_some(*region_id)
+                    })
+                    .with_context(|| error::RepartitionSourceExprMismatchSnafu {
+                        expr: expr_json,
+                    })?;
+
+                Ok(RegionDescriptor {
+                    region_id: matched_region_id,
+                    partition_expr: expr.clone(),
+                })
+            })
+            .collect::<Result<Vec<_>>>()?;
+
+        Ok(descriptors)
+    }
+}
--- a/src/meta-srv/src/procedure/repartition/test_util.rs
+++ b/src/meta-srv/src/procedure/repartition/test_util.rs
@@ -32,6 +32,7 @@ use crate::procedure::test_util::MailboxContext;
 pub struct TestingEnv {
    pub table_metadata_manager: TableMetadataManagerRef,
    pub mailbox_ctx: MailboxContext,
+    pub server_addr: String,
 }

 impl Default for TestingEnv {
@@ -51,10 +52,11 @@ impl TestingEnv {
        Self {
            table_metadata_manager,
            mailbox_ctx,
+            server_addr: "localhost".to_string(),
        }
    }

-    pub fn create_context(self, persistent_context: PersistentContext) -> Context {
+    pub fn create_context(&self, persistent_context: PersistentContext) -> Context {
        let cache_invalidator = Arc::new(MetasrvCacheInvalidator::new(
            self.mailbox_ctx.mailbox().clone(),
            MetasrvInfo {
@@ -66,6 +68,8 @@ impl TestingEnv {
            persistent_ctx: persistent_context,
            table_metadata_manager: self.table_metadata_manager.clone(),
            cache_invalidator,
+            mailbox: self.mailbox_ctx.mailbox().clone(),
+            server_addr: self.server_addr.clone(),
        }
    }
 }
--- a/src/meta-srv/src/procedure/test_util.rs
+++ b/src/meta-srv/src/procedure/test_util.rs
@@ -17,8 +17,8 @@ use std::collections::HashMap;
 use api::v1::meta::mailbox_message::Payload;
 use api::v1::meta::{HeartbeatResponse, MailboxMessage};
 use common_meta::instruction::{
-    DowngradeRegionReply, DowngradeRegionsReply, FlushRegionReply, InstructionReply, SimpleReply,
-    UpgradeRegionReply, UpgradeRegionsReply,
+    DowngradeRegionReply, DowngradeRegionsReply, EnterStagingRegionReply, EnterStagingRegionsReply,
+    FlushRegionReply, InstructionReply, SimpleReply, UpgradeRegionReply, UpgradeRegionsReply,
 };
 use common_meta::key::TableMetadataManagerRef;
 use common_meta::key::table_route::TableRouteValue;
@@ -198,7 +198,7 @@ pub fn new_downgrade_region_reply(
    }
 }

-/// Generates a [InstructionReply::UpgradeRegion] reply.
+/// Generates a [InstructionReply::UpgradeRegions] reply.
 pub fn new_upgrade_region_reply(
    id: u64,
    ready: bool,
@@ -225,6 +225,34 @@ pub fn new_upgrade_region_reply(
    }
 }

+/// Generates a [InstructionReply::EnterStagingRegions] reply.
+pub fn new_enter_staging_region_reply(
+    id: u64,
+    region_id: RegionId,
+    ready: bool,
+    exists: bool,
+    error: Option<String>,
+) -> MailboxMessage {
+    MailboxMessage {
+        id,
+        subject: "mock".to_string(),
+        from: "datanode".to_string(),
+        to: "meta".to_string(),
+        timestamp_millis: current_time_millis(),
+        payload: Some(Payload::Json(
+            serde_json::to_string(&InstructionReply::EnterStagingRegions(
+                EnterStagingRegionsReply::new(vec![EnterStagingRegionReply {
+                    region_id,
+                    ready,
+                    exists,
+                    error,
+                }]),
+            ))
+            .unwrap(),
+        )),
+    }
+}
+
 /// Mock the test data for WAL pruning.
 pub async fn new_wal_prune_metadata(
    table_metadata_manager: TableMetadataManagerRef,
--- a/src/metric-engine/Cargo.toml
+++ b/src/metric-engine/Cargo.toml
@@ -23,6 +23,7 @@ common-recordbatch.workspace = true
 common-runtime.workspace = true
 common-telemetry.workspace = true
 common-time.workspace = true
+chrono.workspace = true
 datafusion.workspace = true
 datatypes.workspace = true
 futures-util.workspace = true
--- a/src/metric-engine/src/engine.rs
+++ b/src/metric-engine/src/engine.rs
@@ -43,10 +43,10 @@ pub(crate) use state::MetricEngineState;
 use store_api::metadata::RegionMetadataRef;
 use store_api::metric_engine_consts::METRIC_ENGINE_NAME;
 use store_api::region_engine::{
-    BatchResponses, CopyRegionFromRequest, CopyRegionFromResponse, RegionEngine,
-    RegionManifestInfo, RegionRole, RegionScannerRef, RegionStatistic, RemapManifestsRequest,
-    RemapManifestsResponse, SetRegionRoleStateResponse, SetRegionRoleStateSuccess,
-    SettableRegionRoleState, SyncManifestResponse,
+    BatchResponses, RegionEngine, RegionRole, RegionScannerRef, RegionStatistic,
+    RemapManifestsRequest, RemapManifestsResponse, SetRegionRoleStateResponse,
+    SetRegionRoleStateSuccess, SettableRegionRoleState, SyncRegionFromRequest,
+    SyncRegionFromResponse,
 };
 use store_api::region_request::{
    BatchRegionDdlRequest, RegionCatchupRequest, RegionOpenRequest, RegionRequest,
@@ -220,6 +220,13 @@ impl RegionEngine for MetricEngine {
                    UnsupportedRegionRequestSnafu { request }.fail()
                }
            }
+            RegionRequest::ApplyStagingManifest(_) => {
+                if self.inner.is_physical_region(region_id) {
+                    return self.inner.mito.handle_request(region_id, request).await;
+                } else {
+                    UnsupportedRegionRequestSnafu { request }.fail()
+                }
+            }
            RegionRequest::Put(put) => self.inner.put_region(region_id, put).await,
            RegionRequest::Create(create) => {
                self.inner
@@ -354,12 +361,30 @@ impl RegionEngine for MetricEngine {
    async fn sync_region(
        &self,
        region_id: RegionId,
-        manifest_info: RegionManifestInfo,
-    ) -> Result<SyncManifestResponse, BoxedError> {
-        self.inner
-            .sync_region(region_id, manifest_info)
-            .await
-            .map_err(BoxedError::new)
+        request: SyncRegionFromRequest,
+    ) -> Result<SyncRegionFromResponse, BoxedError> {
+        match request {
+            SyncRegionFromRequest::FromManifest(manifest_info) => self
+                .inner
+                .sync_region_from_manifest(region_id, manifest_info)
+                .await
+                .map_err(BoxedError::new),
+            SyncRegionFromRequest::FromRegion {
+                source_region_id,
+                parallelism,
+            } => {
+                if self.inner.is_physical_region(region_id) {
+                    self.inner
+                        .sync_region_from_region(region_id, source_region_id, parallelism)
+                        .await
+                        .map_err(BoxedError::new)
+                } else {
+                    Err(BoxedError::new(
+                        error::UnsupportedSyncRegionFromRequestSnafu { region_id }.build(),
+                    ))
+                }
+            }
+        }
    }

    async fn remap_manifests(
@@ -376,14 +401,6 @@ impl RegionEngine for MetricEngine {
        }
    }

-    async fn copy_region_from(
-        &self,
-        _region_id: RegionId,
-        _request: CopyRegionFromRequest,
-    ) -> Result<CopyRegionFromResponse, BoxedError> {
-        todo!()
-    }
-
    async fn set_region_role_state_gracefully(
        &self,
        region_id: RegionId,
--- a/src/metric-engine/src/engine/open.rs
+++ b/src/metric-engine/src/engine/open.rs
@@ -290,6 +290,11 @@ impl MetricEngineInner {
            .metadata_region
            .logical_regions(physical_region_id)
            .await?;
+        common_telemetry::debug!(
+            "Recover states for physical region {}, logical regions: {:?}",
+            physical_region_id,
+            logical_regions
+        );
        let physical_columns = self
            .data_region
            .physical_columns(physical_region_id)
--- a/src/metric-engine/src/engine/options.rs
+++ b/src/metric-engine/src/engine/options.rs
@@ -23,6 +23,7 @@ use store_api::metric_engine_consts::{
    METRIC_ENGINE_INDEX_SKIPPING_INDEX_GRANULARITY_OPTION,
    METRIC_ENGINE_INDEX_SKIPPING_INDEX_GRANULARITY_OPTION_DEFAULT, METRIC_ENGINE_INDEX_TYPE_OPTION,
 };
+use store_api::mito_engine_options::{COMPACTION_TYPE, COMPACTION_TYPE_TWCS, TWCS_TIME_WINDOW};

 use crate::error::{Error, ParseRegionOptionsSnafu, Result};

@@ -32,6 +33,9 @@ use crate::error::{Error, ParseRegionOptionsSnafu, Result};
 /// value and appropriately increasing the size of the index, it results in an improved indexing effect.
 const SEG_ROW_COUNT_FOR_DATA_REGION: u32 = 256;

+/// The default compaction time window for metric engine data regions.
+const DEFAULT_DATA_REGION_COMPACTION_TIME_WINDOW: &str = "1d";
+
 /// Physical region options.
 #[derive(Debug, Clone, Copy, PartialEq)]
 pub struct PhysicalRegionOptions {
@@ -72,6 +76,16 @@ pub fn set_data_region_options(
            "sparse".to_string(),
        );
    }
+    if !options.contains_key(TWCS_TIME_WINDOW) {
+        options.insert(
+            COMPACTION_TYPE.to_string(),
+            COMPACTION_TYPE_TWCS.to_string(),
+        );
+        options.insert(
+            TWCS_TIME_WINDOW.to_string(),
+            DEFAULT_DATA_REGION_COMPACTION_TIME_WINDOW.to_string(),
+        );
+    }
 }

 impl TryFrom<&HashMap<String, String>> for PhysicalRegionOptions {
@@ -192,4 +206,29 @@ mod tests {
            }
        );
    }
+
+    #[test]
+    fn test_set_data_region_options_default_compaction_time_window() {
+        // Test that default time window is set when not specified
+        let mut options = HashMap::new();
+        set_data_region_options(&mut options, false);
+
+        assert_eq!(
+            options.get(COMPACTION_TYPE),
+            Some(&COMPACTION_TYPE_TWCS.to_string())
+        );
+        assert_eq!(options.get(TWCS_TIME_WINDOW), Some(&"1d".to_string()));
+    }
+
+    #[test]
+    fn test_set_data_region_options_respects_user_compaction_time_window() {
+        // Test that user-specified time window is preserved
+        let mut options = HashMap::new();
+        options.insert(TWCS_TIME_WINDOW.to_string(), "2h".to_string());
+        options.insert(COMPACTION_TYPE.to_string(), "twcs".to_string());
+        set_data_region_options(&mut options, false);
+
+        // User's time window should be preserved
+        assert_eq!(options.get(TWCS_TIME_WINDOW), Some(&"2h".to_string()));
+    }
 }
--- a/src/metric-engine/src/engine/sync.rs
+++ b/src/metric-engine/src/engine/sync.rs
@@ -12,242 +12,5 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

-use std::time::Instant;
-
-use common_telemetry::info;
-use snafu::{OptionExt, ResultExt, ensure};
-use store_api::region_engine::{RegionEngine, RegionManifestInfo, SyncManifestResponse};
-use store_api::storage::RegionId;
-
-use crate::engine::MetricEngineInner;
-use crate::error::{
-    MetricManifestInfoSnafu, MitoSyncOperationSnafu, PhysicalRegionNotFoundSnafu, Result,
-};
-use crate::utils;
-
-impl MetricEngineInner {
-    pub async fn sync_region(
-        &self,
-        region_id: RegionId,
-        manifest_info: RegionManifestInfo,
-    ) -> Result<SyncManifestResponse> {
-        ensure!(
-            manifest_info.is_metric(),
-            MetricManifestInfoSnafu { region_id }
-        );
-
-        let metadata_region_id = utils::to_metadata_region_id(region_id);
-        // checked by ensure above
-        let metadata_manifest_version = manifest_info
-            .metadata_manifest_version()
-            .unwrap_or_default();
-        let metadata_flushed_entry_id = manifest_info
-            .metadata_flushed_entry_id()
-            .unwrap_or_default();
-        let metadata_region_manifest =
-            RegionManifestInfo::mito(metadata_manifest_version, metadata_flushed_entry_id, 0);
-        let metadata_synced = self
-            .mito
-            .sync_region(metadata_region_id, metadata_region_manifest)
-            .await
-            .context(MitoSyncOperationSnafu)?
-            .is_data_synced();
-
-        let data_region_id = utils::to_data_region_id(region_id);
-        let data_manifest_version = manifest_info.data_manifest_version();
-        let data_flushed_entry_id = manifest_info.data_flushed_entry_id();
-        let data_region_manifest =
-            RegionManifestInfo::mito(data_manifest_version, data_flushed_entry_id, 0);
-
-        let data_synced = self
-            .mito
-            .sync_region(data_region_id, data_region_manifest)
-            .await
-            .context(MitoSyncOperationSnafu)?
-            .is_data_synced();
-
-        if !metadata_synced {
-            return Ok(SyncManifestResponse::Metric {
-                metadata_synced,
-                data_synced,
-                new_opened_logical_region_ids: vec![],
-            });
-        }
-
-        let now = Instant::now();
-        // Recovers the states from the metadata region
-        // if the metadata manifest version is updated.
-        let physical_region_options = *self
-            .state
-            .read()
-            .unwrap()
-            .physical_region_states()
-            .get(&data_region_id)
-            .context(PhysicalRegionNotFoundSnafu {
-                region_id: data_region_id,
-            })?
-            .options();
-        let new_opened_logical_region_ids = self
-            .recover_states(data_region_id, physical_region_options)
-            .await?;
-        info!(
-            "Sync metadata region for physical region {}, cost: {:?}, new opened logical region ids: {:?}",
-            data_region_id,
-            now.elapsed(),
-            new_opened_logical_region_ids
-        );
-
-        Ok(SyncManifestResponse::Metric {
-            metadata_synced,
-            data_synced,
-            new_opened_logical_region_ids,
-        })
-    }
-}
-
-#[cfg(test)]
-mod tests {
-    use std::collections::HashMap;
-
-    use api::v1::SemanticType;
-    use common_query::prelude::greptime_timestamp;
-    use common_telemetry::info;
-    use datatypes::data_type::ConcreteDataType;
-    use datatypes::schema::ColumnSchema;
-    use store_api::metadata::ColumnMetadata;
-    use store_api::region_engine::{RegionEngine, RegionManifestInfo};
-    use store_api::region_request::{
-        AddColumn, AlterKind, RegionAlterRequest, RegionFlushRequest, RegionRequest,
-    };
-    use store_api::storage::RegionId;
-
-    use crate::metadata_region::MetadataRegion;
-    use crate::test_util::TestEnv;
-
-    #[tokio::test]
-    async fn test_sync_region_with_new_created_logical_regions() {
-        common_telemetry::init_default_ut_logging();
-        let mut env = TestEnv::with_prefix("sync_with_new_created_logical_regions").await;
-        env.init_metric_region().await;
-
-        info!("creating follower engine");
-        // Create a follower engine.
-        let (_follower_mito, follower_metric) = env.create_follower_engine().await;
-
-        let physical_region_id = env.default_physical_region_id();
-
-        // Flushes the physical region
-        let metric_engine = env.metric();
-        metric_engine
-            .handle_request(
-                env.default_physical_region_id(),
-                RegionRequest::Flush(RegionFlushRequest::default()),
-            )
-            .await
-            .unwrap();
-
-        let response = follower_metric
-            .sync_region(physical_region_id, RegionManifestInfo::metric(1, 0, 1, 0))
-            .await
-            .unwrap();
-        assert!(response.is_metric());
-        let new_opened_logical_region_ids = response.new_opened_logical_region_ids().unwrap();
-        assert_eq!(new_opened_logical_region_ids, vec![RegionId::new(3, 2)]);
-
-        // Sync again, no new logical region should be opened
-        let response = follower_metric
-            .sync_region(physical_region_id, RegionManifestInfo::metric(1, 0, 1, 0))
-            .await
-            .unwrap();
-        assert!(response.is_metric());
-        let new_opened_logical_region_ids = response.new_opened_logical_region_ids().unwrap();
-        assert!(new_opened_logical_region_ids.is_empty());
-    }
-
-    fn test_alter_logical_region_request() -> RegionAlterRequest {
-        RegionAlterRequest {
-            kind: AlterKind::AddColumns {
-                columns: vec![AddColumn {
-                    column_metadata: ColumnMetadata {
-                        column_id: 0,
-                        semantic_type: SemanticType::Tag,
-                        column_schema: ColumnSchema::new(
-                            "tag1",
-                            ConcreteDataType::string_datatype(),
-                            false,
-                        ),
-                    },
-                    location: None,
-                }],
-            },
-        }
-    }
-
-    #[tokio::test]
-    async fn test_sync_region_alter_alter_logical_region() {
-        common_telemetry::init_default_ut_logging();
-        let mut env = TestEnv::with_prefix("sync_region_alter_alter_logical_region").await;
-        env.init_metric_region().await;
-
-        info!("creating follower engine");
-        let physical_region_id = env.default_physical_region_id();
-        // Flushes the physical region
-        let metric_engine = env.metric();
-        metric_engine
-            .handle_request(
-                env.default_physical_region_id(),
-                RegionRequest::Flush(RegionFlushRequest::default()),
-            )
-            .await
-            .unwrap();
-
-        // Create a follower engine.
-        let (follower_mito, follower_metric) = env.create_follower_engine().await;
-        let metric_engine = env.metric();
-        let engine_inner = env.metric().inner;
-        let region_id = env.default_logical_region_id();
-        let request = test_alter_logical_region_request();
-
-        engine_inner
-            .alter_logical_regions(
-                physical_region_id,
-                vec![(region_id, request)],
-                &mut HashMap::new(),
-            )
-            .await
-            .unwrap();
-
-        // Flushes the physical region
-        metric_engine
-            .handle_request(
-                env.default_physical_region_id(),
-                RegionRequest::Flush(RegionFlushRequest::default()),
-            )
-            .await
-            .unwrap();
-
-        // Sync the follower engine
-        let response = follower_metric
-            .sync_region(physical_region_id, RegionManifestInfo::metric(2, 0, 2, 0))
-            .await
-            .unwrap();
-        assert!(response.is_metric());
-        let new_opened_logical_region_ids = response.new_opened_logical_region_ids().unwrap();
-        assert!(new_opened_logical_region_ids.is_empty());
-
-        let logical_region_id = env.default_logical_region_id();
-        let metadata_region = MetadataRegion::new(follower_mito.clone());
-        let semantic_type = metadata_region
-            .column_semantic_type(physical_region_id, logical_region_id, "tag1")
-            .await
-            .unwrap()
-            .unwrap();
-        assert_eq!(semantic_type, SemanticType::Tag);
-        let timestamp_index = metadata_region
-            .column_semantic_type(physical_region_id, logical_region_id, greptime_timestamp())
-            .await
-            .unwrap()
-            .unwrap();
-        assert_eq!(timestamp_index, SemanticType::Timestamp);
-    }
-}
+mod manifest;
+mod region;
--- a/src/metric-engine/src/engine/sync/manifest.rs
+++ b/src/metric-engine/src/engine/sync/manifest.rs
@@ -0,0 +1,268 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::time::Instant;
+
+use common_telemetry::info;
+use snafu::{OptionExt, ResultExt, ensure};
+use store_api::region_engine::{RegionEngine, RegionManifestInfo, SyncRegionFromResponse};
+use store_api::storage::RegionId;
+
+use crate::engine::MetricEngineInner;
+use crate::error::{
+    MetricManifestInfoSnafu, MitoSyncOperationSnafu, PhysicalRegionNotFoundSnafu, Result,
+};
+use crate::utils;
+
+impl MetricEngineInner {
+    /// Syncs the region from the given manifest information (leader-follower scenario).
+    ///
+    /// This operation:
+    /// 1. Syncs the metadata region manifest to the target version.
+    /// 2. Syncs the data region manifest to the target version.
+    /// 3. Recovers states and returns newly opened logical regions (if metadata was synced)
+    pub async fn sync_region_from_manifest(
+        &self,
+        region_id: RegionId,
+        manifest_info: RegionManifestInfo,
+    ) -> Result<SyncRegionFromResponse> {
+        ensure!(
+            manifest_info.is_metric(),
+            MetricManifestInfoSnafu { region_id }
+        );
+
+        let metadata_region_id = utils::to_metadata_region_id(region_id);
+        // checked by ensure above
+        let metadata_manifest_version = manifest_info
+            .metadata_manifest_version()
+            .unwrap_or_default();
+        let metadata_flushed_entry_id = manifest_info
+            .metadata_flushed_entry_id()
+            .unwrap_or_default();
+        let metadata_region_manifest =
+            RegionManifestInfo::mito(metadata_manifest_version, metadata_flushed_entry_id, 0);
+        let metadata_synced = self
+            .mito
+            .sync_region(metadata_region_id, metadata_region_manifest.into())
+            .await
+            .context(MitoSyncOperationSnafu)?
+            .is_data_synced();
+
+        let data_region_id = utils::to_data_region_id(region_id);
+        let data_manifest_version = manifest_info.data_manifest_version();
+        let data_flushed_entry_id = manifest_info.data_flushed_entry_id();
+        let data_region_manifest =
+            RegionManifestInfo::mito(data_manifest_version, data_flushed_entry_id, 0);
+
+        let data_synced = self
+            .mito
+            .sync_region(data_region_id, data_region_manifest.into())
+            .await
+            .context(MitoSyncOperationSnafu)?
+            .is_data_synced();
+
+        if !metadata_synced {
+            return Ok(SyncRegionFromResponse::Metric {
+                metadata_synced,
+                data_synced,
+                new_opened_logical_region_ids: vec![],
+            });
+        }
+
+        let now = Instant::now();
+        // Recovers the states from the metadata region
+        // if the metadata manifest version is updated.
+        let physical_region_options = *self
+            .state
+            .read()
+            .unwrap()
+            .physical_region_states()
+            .get(&data_region_id)
+            .context(PhysicalRegionNotFoundSnafu {
+                region_id: data_region_id,
+            })?
+            .options();
+        let new_opened_logical_region_ids = self
+            .recover_states(data_region_id, physical_region_options)
+            .await?;
+        info!(
+            "Sync metadata region for physical region {}, cost: {:?}, new opened logical region ids: {:?}",
+            data_region_id,
+            now.elapsed(),
+            new_opened_logical_region_ids
+        );
+
+        Ok(SyncRegionFromResponse::Metric {
+            metadata_synced,
+            data_synced,
+            new_opened_logical_region_ids,
+        })
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::collections::HashMap;
+
+    use api::v1::SemanticType;
+    use common_query::prelude::greptime_timestamp;
+    use common_telemetry::info;
+    use datatypes::data_type::ConcreteDataType;
+    use datatypes::schema::ColumnSchema;
+    use store_api::metadata::ColumnMetadata;
+    use store_api::region_engine::{RegionEngine, RegionManifestInfo};
+    use store_api::region_request::{
+        AddColumn, AlterKind, RegionAlterRequest, RegionFlushRequest, RegionRequest,
+    };
+    use store_api::storage::RegionId;
+
+    use crate::metadata_region::MetadataRegion;
+    use crate::test_util::TestEnv;
+
+    #[tokio::test]
+    async fn test_sync_region_with_new_created_logical_regions() {
+        common_telemetry::init_default_ut_logging();
+        let mut env = TestEnv::with_prefix("sync_with_new_created_logical_regions").await;
+        env.init_metric_region().await;
+
+        info!("creating follower engine");
+        // Create a follower engine.
+        let (_follower_mito, follower_metric) = env.create_follower_engine().await;
+
+        let physical_region_id = env.default_physical_region_id();
+
+        // Flushes the physical region
+        let metric_engine = env.metric();
+        metric_engine
+            .handle_request(
+                env.default_physical_region_id(),
+                RegionRequest::Flush(RegionFlushRequest::default()),
+            )
+            .await
+            .unwrap();
+
+        let response = follower_metric
+            .sync_region(
+                physical_region_id,
+                RegionManifestInfo::metric(1, 0, 1, 0).into(),
+            )
+            .await
+            .unwrap();
+        assert!(response.is_metric());
+        let new_opened_logical_region_ids = response.new_opened_logical_region_ids().unwrap();
+        assert_eq!(new_opened_logical_region_ids, vec![RegionId::new(3, 2)]);
+
+        // Sync again, no new logical region should be opened
+        let response = follower_metric
+            .sync_region(
+                physical_region_id,
+                RegionManifestInfo::metric(1, 0, 1, 0).into(),
+            )
+            .await
+            .unwrap();
+        assert!(response.is_metric());
+        let new_opened_logical_region_ids = response.new_opened_logical_region_ids().unwrap();
+        assert!(new_opened_logical_region_ids.is_empty());
+    }
+
+    fn test_alter_logical_region_request() -> RegionAlterRequest {
+        RegionAlterRequest {
+            kind: AlterKind::AddColumns {
+                columns: vec![AddColumn {
+                    column_metadata: ColumnMetadata {
+                        column_id: 0,
+                        semantic_type: SemanticType::Tag,
+                        column_schema: ColumnSchema::new(
+                            "tag1",
+                            ConcreteDataType::string_datatype(),
+                            false,
+                        ),
+                    },
+                    location: None,
+                }],
+            },
+        }
+    }
+
+    #[tokio::test]
+    async fn test_sync_region_alter_alter_logical_region() {
+        common_telemetry::init_default_ut_logging();
+        let mut env = TestEnv::with_prefix("sync_region_alter_alter_logical_region").await;
+        env.init_metric_region().await;
+
+        info!("creating follower engine");
+        let physical_region_id = env.default_physical_region_id();
+        // Flushes the physical region
+        let metric_engine = env.metric();
+        metric_engine
+            .handle_request(
+                env.default_physical_region_id(),
+                RegionRequest::Flush(RegionFlushRequest::default()),
+            )
+            .await
+            .unwrap();
+
+        // Create a follower engine.
+        let (follower_mito, follower_metric) = env.create_follower_engine().await;
+        let metric_engine = env.metric();
+        let engine_inner = env.metric().inner;
+        let region_id = env.default_logical_region_id();
+        let request = test_alter_logical_region_request();
+
+        engine_inner
+            .alter_logical_regions(
+                physical_region_id,
+                vec![(region_id, request)],
+                &mut HashMap::new(),
+            )
+            .await
+            .unwrap();
+
+        // Flushes the physical region
+        metric_engine
+            .handle_request(
+                env.default_physical_region_id(),
+                RegionRequest::Flush(RegionFlushRequest::default()),
+            )
+            .await
+            .unwrap();
+
+        // Sync the follower engine
+        let response = follower_metric
+            .sync_region(
+                physical_region_id,
+                RegionManifestInfo::metric(2, 0, 2, 0).into(),
+            )
+            .await
+            .unwrap();
+        assert!(response.is_metric());
+        let new_opened_logical_region_ids = response.new_opened_logical_region_ids().unwrap();
+        assert!(new_opened_logical_region_ids.is_empty());
+
+        let logical_region_id = env.default_logical_region_id();
+        let metadata_region = MetadataRegion::new(follower_mito.clone());
+        let semantic_type = metadata_region
+            .column_semantic_type(physical_region_id, logical_region_id, "tag1")
+            .await
+            .unwrap()
+            .unwrap();
+        assert_eq!(semantic_type, SemanticType::Tag);
+        let timestamp_index = metadata_region
+            .column_semantic_type(physical_region_id, logical_region_id, greptime_timestamp())
+            .await
+            .unwrap()
+            .unwrap();
+        assert_eq!(timestamp_index, SemanticType::Timestamp);
+    }
+}
--- a/src/metric-engine/src/engine/sync/region.rs
+++ b/src/metric-engine/src/engine/sync/region.rs
@@ -0,0 +1,386 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::time::Instant;
+
+use common_error::ext::BoxedError;
+use common_telemetry::info;
+use mito2::manifest::action::RegionEdit;
+use snafu::{OptionExt, ResultExt, ensure};
+use store_api::region_engine::{MitoCopyRegionFromRequest, SyncRegionFromResponse};
+use store_api::storage::RegionId;
+
+use crate::engine::MetricEngineInner;
+use crate::error::{
+    MissingFilesSnafu, MitoCopyRegionFromOperationSnafu, MitoEditRegionSnafu,
+    PhysicalRegionNotFoundSnafu, Result,
+};
+use crate::utils;
+
+impl MetricEngineInner {
+    /// Syncs the logical regions from the source region to the target region in the metric engine.
+    ///
+    /// This operation:
+    /// 1. Copies SST files from source metadata region to target metadata region
+    /// 2. Transforms logical region metadata (updates region numbers to match target)
+    /// 3. Edits target manifest to remove old file entries (copied files)
+    /// 4. Recovers states and returns newly opened logical region IDs
+    ///
+    /// **Note**: Only the metadata region is synced. The data region is not affected.
+    pub(crate) async fn sync_region_from_region(
+        &self,
+        region_id: RegionId,
+        source_region_id: RegionId,
+        parallelism: usize,
+    ) -> Result<SyncRegionFromResponse> {
+        let source_metadata_region_id = utils::to_metadata_region_id(source_region_id);
+        let target_metadata_region_id = utils::to_metadata_region_id(region_id);
+        let target_data_region_id = utils::to_data_region_id(region_id);
+        let source_data_region_id = utils::to_data_region_id(source_region_id);
+        info!(
+            "Syncing region from region {} to region {}, parallelism: {}",
+            source_region_id, region_id, parallelism
+        );
+
+        let res = self
+            .mito
+            .copy_region_from(
+                target_metadata_region_id,
+                MitoCopyRegionFromRequest {
+                    source_region_id: source_metadata_region_id,
+                    parallelism,
+                },
+            )
+            .await
+            .map_err(BoxedError::new)
+            .context(MitoCopyRegionFromOperationSnafu {
+                source_region_id: source_metadata_region_id,
+                target_region_id: target_metadata_region_id,
+            })?;
+
+        if res.copied_file_ids.is_empty() {
+            info!(
+                "No files were copied from source region {} to target region {}, copied file ids are empty",
+                source_metadata_region_id, target_metadata_region_id
+            );
+            return Ok(SyncRegionFromResponse::Metric {
+                metadata_synced: false,
+                data_synced: false,
+                new_opened_logical_region_ids: vec![],
+            });
+        }
+
+        let target_region = self.mito.find_region(target_metadata_region_id).context(
+            PhysicalRegionNotFoundSnafu {
+                region_id: target_metadata_region_id,
+            },
+        )?;
+        let files_to_remove = target_region.file_metas(&res.copied_file_ids).await;
+        let missing_file_ids = res
+            .copied_file_ids
+            .iter()
+            .zip(&files_to_remove)
+            .filter_map(|(file_id, maybe_meta)| {
+                if maybe_meta.is_none() {
+                    Some(*file_id)
+                } else {
+                    None
+                }
+            })
+            .collect::<Vec<_>>();
+        // `copy_region_from` does not trigger compaction,
+        // so there should be no files removed and thus no missing files.
+        ensure!(
+            missing_file_ids.is_empty(),
+            MissingFilesSnafu {
+                region_id: target_metadata_region_id,
+                file_ids: missing_file_ids,
+            }
+        );
+        let files_to_remove = files_to_remove.into_iter().flatten().collect::<Vec<_>>();
+        // Transform the logical region metadata of the target data region.
+        self.metadata_region
+            .transform_logical_region_metadata(target_data_region_id, source_data_region_id)
+            .await?;
+
+        let edit = RegionEdit {
+            files_to_add: vec![],
+            files_to_remove: files_to_remove.clone(),
+            timestamp_ms: Some(chrono::Utc::now().timestamp_millis()),
+            compaction_time_window: None,
+            flushed_entry_id: None,
+            flushed_sequence: None,
+            committed_sequence: None,
+        };
+        self.mito
+            .edit_region(target_metadata_region_id, edit)
+            .await
+            .map_err(BoxedError::new)
+            .context(MitoEditRegionSnafu {
+                region_id: target_metadata_region_id,
+            })?;
+        info!(
+            "Successfully edit metadata region: {} after syncing from source metadata region: {}, files to remove: {:?}",
+            target_metadata_region_id,
+            source_metadata_region_id,
+            files_to_remove
+                .iter()
+                .map(|meta| meta.file_id)
+                .collect::<Vec<_>>(),
+        );
+
+        let now = Instant::now();
+        // Always recover states from the target metadata region after syncing
+        // from the source metadata region.
+        let physical_region_options = *self
+            .state
+            .read()
+            .unwrap()
+            .physical_region_states()
+            .get(&target_data_region_id)
+            .context(PhysicalRegionNotFoundSnafu {
+                region_id: target_data_region_id,
+            })?
+            .options();
+        let new_opened_logical_region_ids = self
+            .recover_states(target_data_region_id, physical_region_options)
+            .await?;
+        info!(
+            "Sync metadata region from source region {} to target region {}, recover states cost: {:?}, new opened logical region ids: {:?}",
+            source_metadata_region_id,
+            target_metadata_region_id,
+            now.elapsed(),
+            new_opened_logical_region_ids
+        );
+
+        Ok(SyncRegionFromResponse::Metric {
+            metadata_synced: true,
+            data_synced: false,
+            new_opened_logical_region_ids,
+        })
+    }
+}
+
+#[cfg(test)]
+mod tests {
+
+    use common_error::ext::ErrorExt;
+    use common_error::status_code::StatusCode;
+    use common_telemetry::debug;
+    use store_api::metric_engine_consts::{METRIC_ENGINE_NAME, PHYSICAL_TABLE_METADATA_KEY};
+    use store_api::region_engine::{RegionEngine, SyncRegionFromRequest};
+    use store_api::region_request::{
+        BatchRegionDdlRequest, PathType, RegionCloseRequest, RegionFlushRequest, RegionOpenRequest,
+        RegionRequest,
+    };
+    use store_api::storage::RegionId;
+
+    use crate::metadata_region::MetadataRegion;
+    use crate::test_util::{TestEnv, create_logical_region_request};
+
+    async fn assert_logical_table_columns(
+        metadata_region: &MetadataRegion,
+        physical_region_id: RegionId,
+        logical_region_id: RegionId,
+        expected_columns: &[&str],
+    ) {
+        let mut columns = metadata_region
+            .logical_columns(physical_region_id, logical_region_id)
+            .await
+            .unwrap()
+            .into_iter()
+            .map(|(n, _)| n)
+            .collect::<Vec<_>>();
+        columns.sort_unstable();
+        assert_eq!(columns, expected_columns);
+    }
+
+    #[tokio::test]
+    async fn test_sync_region_from_region() {
+        common_telemetry::init_default_ut_logging();
+        let env = TestEnv::new().await;
+        let metric_engine = env.metric();
+        let source_physical_region_id = RegionId::new(1024, 0);
+        let logical_region_id1 = RegionId::new(1025, 0);
+        let logical_region_id2 = RegionId::new(1026, 0);
+        env.create_physical_region(source_physical_region_id, "/test_dir1", vec![])
+            .await;
+        let region_create_request1 =
+            create_logical_region_request(&["job"], source_physical_region_id, "logical1");
+        let region_create_request2 =
+            create_logical_region_request(&["host"], source_physical_region_id, "logical2");
+        metric_engine
+            .handle_batch_ddl_requests(BatchRegionDdlRequest::Create(vec![
+                (logical_region_id1, region_create_request1),
+                (logical_region_id2, region_create_request2),
+            ]))
+            .await
+            .unwrap();
+        debug!("Flushing source physical region");
+        metric_engine
+            .handle_request(
+                source_physical_region_id,
+                RegionRequest::Flush(RegionFlushRequest {
+                    row_group_size: None,
+                }),
+            )
+            .await
+            .unwrap();
+        let logical_regions = metric_engine
+            .logical_regions(source_physical_region_id)
+            .await
+            .unwrap();
+        assert!(logical_regions.contains(&logical_region_id1));
+        assert!(logical_regions.contains(&logical_region_id2));
+
+        let target_physical_region_id = RegionId::new(1024, 1);
+        let target_logical_region_id1 = RegionId::new(1025, 1);
+        let target_logical_region_id2 = RegionId::new(1026, 1);
+        // Prepare target physical region
+        env.create_physical_region(target_physical_region_id, "/test_dir1", vec![])
+            .await;
+        let r = metric_engine
+            .sync_region(
+                target_physical_region_id,
+                SyncRegionFromRequest::FromRegion {
+                    source_region_id: source_physical_region_id,
+                    parallelism: 1,
+                },
+            )
+            .await
+            .unwrap();
+        let new_opened_logical_region_ids = r.new_opened_logical_region_ids().unwrap();
+        assert_eq!(new_opened_logical_region_ids.len(), 2);
+        assert!(new_opened_logical_region_ids.contains(&target_logical_region_id1));
+        assert!(new_opened_logical_region_ids.contains(&target_logical_region_id2));
+        debug!("Sync region from again");
+        assert_logical_table_columns(
+            &env.metadata_region(),
+            target_physical_region_id,
+            target_logical_region_id1,
+            &["greptime_timestamp", "greptime_value", "job"],
+        )
+        .await;
+        assert_logical_table_columns(
+            &env.metadata_region(),
+            target_physical_region_id,
+            target_logical_region_id2,
+            &["greptime_timestamp", "greptime_value", "host"],
+        )
+        .await;
+        let logical_regions = env
+            .metadata_region()
+            .logical_regions(target_physical_region_id)
+            .await
+            .unwrap();
+        assert_eq!(logical_regions.len(), 2);
+        assert!(logical_regions.contains(&target_logical_region_id1));
+        assert!(logical_regions.contains(&target_logical_region_id2));
+
+        // Should be ok to sync region from again.
+        let r = metric_engine
+            .sync_region(
+                target_physical_region_id,
+                SyncRegionFromRequest::FromRegion {
+                    source_region_id: source_physical_region_id,
+                    parallelism: 1,
+                },
+            )
+            .await
+            .unwrap();
+        let new_opened_logical_region_ids = r.new_opened_logical_region_ids().unwrap();
+        assert!(new_opened_logical_region_ids.is_empty());
+
+        // Try to close region and reopen it, should be ok.
+        metric_engine
+            .handle_request(
+                target_physical_region_id,
+                RegionRequest::Close(RegionCloseRequest {}),
+            )
+            .await
+            .unwrap();
+        let physical_region_option = [(PHYSICAL_TABLE_METADATA_KEY.to_string(), String::new())]
+            .into_iter()
+            .collect();
+        metric_engine
+            .handle_request(
+                target_physical_region_id,
+                RegionRequest::Open(RegionOpenRequest {
+                    engine: METRIC_ENGINE_NAME.to_string(),
+                    table_dir: "/test_dir1".to_string(),
+                    path_type: PathType::Bare,
+                    options: physical_region_option,
+                    skip_wal_replay: false,
+                    checkpoint: None,
+                }),
+            )
+            .await
+            .unwrap();
+        let logical_regions = env
+            .metadata_region()
+            .logical_regions(target_physical_region_id)
+            .await
+            .unwrap();
+        assert_eq!(logical_regions.len(), 2);
+        assert!(logical_regions.contains(&target_logical_region_id1));
+        assert!(logical_regions.contains(&target_logical_region_id2));
+    }
+
+    #[tokio::test]
+    async fn test_sync_region_from_region_with_no_files() {
+        common_telemetry::init_default_ut_logging();
+        let env = TestEnv::new().await;
+        let metric_engine = env.metric();
+        let source_physical_region_id = RegionId::new(1024, 0);
+        env.create_physical_region(source_physical_region_id, "/test_dir1", vec![])
+            .await;
+        let target_physical_region_id = RegionId::new(1024, 1);
+        env.create_physical_region(target_physical_region_id, "/test_dir1", vec![])
+            .await;
+        let r = metric_engine
+            .sync_region(
+                target_physical_region_id,
+                SyncRegionFromRequest::FromRegion {
+                    source_region_id: source_physical_region_id,
+                    parallelism: 1,
+                },
+            )
+            .await
+            .unwrap();
+        let new_opened_logical_region_ids = r.new_opened_logical_region_ids().unwrap();
+        assert!(new_opened_logical_region_ids.is_empty());
+    }
+
+    #[tokio::test]
+    async fn test_sync_region_from_region_source_not_exist() {
+        common_telemetry::init_default_ut_logging();
+        let env = TestEnv::new().await;
+        let metric_engine = env.metric();
+        let source_physical_region_id = RegionId::new(1024, 0);
+        let target_physical_region_id = RegionId::new(1024, 1);
+        env.create_physical_region(target_physical_region_id, "/test_dir1", vec![])
+            .await;
+        let err = metric_engine
+            .sync_region(
+                target_physical_region_id,
+                SyncRegionFromRequest::FromRegion {
+                    source_region_id: source_physical_region_id,
+                    parallelism: 1,
+                },
+            )
+            .await
+            .unwrap_err();
+        assert_eq!(err.status_code(), StatusCode::InvalidArguments);
+    }
+}
--- a/src/metric-engine/src/error.rs
+++ b/src/metric-engine/src/error.rs
@@ -21,7 +21,7 @@ use common_macro::stack_trace_debug;
 use datatypes::prelude::ConcreteDataType;
 use snafu::{Location, Snafu};
 use store_api::region_request::RegionRequest;
-use store_api::storage::RegionId;
+use store_api::storage::{FileId, RegionId};

 #[derive(Snafu)]
 #[snafu(visibility(pub))]
@@ -128,6 +128,27 @@ pub enum Error {
        location: Location,
    },

+    #[snafu(display(
+        "Mito copy region from operation fails, source region id: {}, target region id: {}",
+        source_region_id,
+        target_region_id
+    ))]
+    MitoCopyRegionFromOperation {
+        source: BoxedError,
+        #[snafu(implicit)]
+        location: Location,
+        source_region_id: RegionId,
+        target_region_id: RegionId,
+    },
+
+    #[snafu(display("Mito edit region operation fails, region id: {}", region_id))]
+    MitoEditRegion {
+        region_id: RegionId,
+        source: BoxedError,
+        #[snafu(implicit)]
+        location: Location,
+    },
+
    #[snafu(display("Failed to encode primary key"))]
    EncodePrimaryKey {
        source: mito_codec::error::Error,
@@ -256,6 +277,21 @@ pub enum Error {
        location: Location,
    },

+    #[snafu(display("Unsupported sync region from request for region {}", region_id))]
+    UnsupportedSyncRegionFromRequest {
+        region_id: RegionId,
+        #[snafu(implicit)]
+        location: Location,
+    },
+
+    #[snafu(display("Missing file metas in region {}, file ids: {:?}", region_id, file_ids))]
+    MissingFiles {
+        region_id: RegionId,
+        #[snafu(implicit)]
+        location: Location,
+        file_ids: Vec<FileId>,
+    },
+
    #[snafu(display("Unsupported alter kind: {}", kind))]
    UnsupportedAlterKind {
        kind: String,
@@ -339,11 +375,12 @@ impl ErrorExt for Error {
            | ParseRegionOptions { .. }
            | UnexpectedRequest { .. }
            | UnsupportedAlterKind { .. }
-            | UnsupportedRemapManifestsRequest { .. } => StatusCode::InvalidArguments,
+            | UnsupportedRemapManifestsRequest { .. }
+            | UnsupportedSyncRegionFromRequest { .. } => StatusCode::InvalidArguments,

-            ForbiddenPhysicalAlter { .. } | UnsupportedRegionRequest { .. } => {
-                StatusCode::Unsupported
-            }
+            ForbiddenPhysicalAlter { .. }
+            | UnsupportedRegionRequest { .. }
+            | MissingFiles { .. } => StatusCode::Unsupported,

            DeserializeColumnMetadata { .. }
            | SerializeColumnMetadata { .. }
@@ -369,7 +406,9 @@ impl ErrorExt for Error {
            | MitoSyncOperation { source, .. }
            | MitoEnterStagingOperation { source, .. }
            | BatchOpenMitoRegion { source, .. }
-            | BatchCatchupMitoRegion { source, .. } => source.status_code(),
+            | BatchCatchupMitoRegion { source, .. }
+            | MitoCopyRegionFromOperation { source, .. }
+            | MitoEditRegion { source, .. } => source.status_code(),

            EncodePrimaryKey { source, .. } => source.status_code(),

--- a/src/metric-engine/src/metadata_region.rs
+++ b/src/metric-engine/src/metadata_region.rs
@@ -25,6 +25,7 @@ use base64::Engine;
 use base64::engine::general_purpose::STANDARD_NO_PAD;
 use common_base::readable_size::ReadableSize;
 use common_recordbatch::{RecordBatch, SendableRecordBatchStream};
+use common_telemetry::{debug, info, warn};
 use datafusion::prelude::{col, lit};
 use futures_util::TryStreamExt;
 use futures_util::stream::BoxStream;
@@ -400,14 +401,11 @@ impl MetadataRegion {
            .await
            .context(CacheGetSnafu)?;

-        let range = region_metadata.key_values.range(prefix.to_string()..);
        let mut result = HashMap::new();
-        for (k, v) in range {
-            if !k.starts_with(prefix) {
-                break;
-            }
-            result.insert(k.clone(), v.clone());
-        }
+        get_all_with_prefix(&region_metadata, prefix, |k, v| {
+            result.insert(k.to_string(), v.to_string());
+            Ok(())
+        })?;
        Ok(result)
    }

@@ -558,6 +556,109 @@ impl MetadataRegion {

        Ok(())
    }
+
+    /// Updates logical region metadata so that any entries previously referencing
+    /// `source_region_id` are modified to reference the data region of `physical_region_id`.
+    ///
+    /// This method should be called after copying files from `source_region_id`
+    /// into the target region. It scans the metadata for the target physical
+    /// region, finds logical regions with the same region number as the source,
+    /// and reinserts region and column entries updated to use the target's
+    /// region number.
+    pub async fn transform_logical_region_metadata(
+        &self,
+        physical_region_id: RegionId,
+        source_region_id: RegionId,
+    ) -> Result<()> {
+        let metadata_region_id = utils::to_metadata_region_id(physical_region_id);
+        let data_region_id = utils::to_data_region_id(physical_region_id);
+        let logical_regions = self
+            .logical_regions(data_region_id)
+            .await?
+            .into_iter()
+            .filter(|r| r.region_number() == source_region_id.region_number())
+            .collect::<Vec<_>>();
+        if logical_regions.is_empty() {
+            info!(
+                "No logical regions found from source region {}, physical region id: {}",
+                source_region_id, physical_region_id,
+            );
+            return Ok(());
+        }
+
+        let metadata = self.load_all(metadata_region_id).await?;
+        let mut output = Vec::new();
+        for logical_region_id in &logical_regions {
+            let prefix = MetadataRegion::concat_column_key_prefix(*logical_region_id);
+            get_all_with_prefix(&metadata, &prefix, |k, v| {
+                // Safety: we have checked the prefix
+                let (src_logical_region_id, column_name) = Self::parse_column_key(k)?.unwrap();
+                // Change the region number to the data region number.
+                let new_key = MetadataRegion::concat_column_key(
+                    RegionId::new(
+                        src_logical_region_id.table_id(),
+                        data_region_id.region_number(),
+                    ),
+                    &column_name,
+                );
+                output.push((new_key, v.to_string()));
+                Ok(())
+            })?;
+
+            let new_key = MetadataRegion::concat_region_key(RegionId::new(
+                logical_region_id.table_id(),
+                data_region_id.region_number(),
+            ));
+            output.push((new_key, String::new()));
+        }
+
+        if output.is_empty() {
+            warn!(
+                "No logical regions metadata found from source region {}, physical region id: {}",
+                source_region_id, physical_region_id
+            );
+            return Ok(());
+        }
+
+        debug!(
+            "Transform logical regions metadata to physical region {}, source region: {}, transformed metadata: {}",
+            data_region_id,
+            source_region_id,
+            output.len(),
+        );
+
+        let put_request = MetadataRegion::build_put_request_from_iter(output.into_iter());
+        self.mito
+            .handle_request(
+                metadata_region_id,
+                store_api::region_request::RegionRequest::Put(put_request),
+            )
+            .await
+            .context(MitoWriteOperationSnafu)?;
+        info!(
+            "Transformed {} logical regions metadata to physical region {}, source region: {}",
+            logical_regions.len(),
+            data_region_id,
+            source_region_id
+        );
+        self.cache.invalidate(&metadata_region_id).await;
+        Ok(())
+    }
+}
+
+fn get_all_with_prefix(
+    region_metadata: &RegionMetadataCacheEntry,
+    prefix: &str,
+    mut callback: impl FnMut(&str, &str) -> Result<()>,
+) -> Result<()> {
+    let range = region_metadata.key_values.range(prefix.to_string()..);
+    for (k, v) in range {
+        if !k.starts_with(prefix) {
+            break;
+        }
+        callback(k, v)?;
+    }
+    Ok(())
 }

 #[cfg(test)]
--- a/src/mito2/src/cache/file_cache.rs
+++ b/src/mito2/src/cache/file_cache.rs
@@ -727,7 +727,7 @@ impl fmt::Display for FileType {

 impl FileType {
    /// Parses the file type from string.
-    fn parse(s: &str) -> Option<FileType> {
+    pub(crate) fn parse(s: &str) -> Option<FileType> {
        match s {
            "parquet" => Some(FileType::Parquet),
            "puffin" => Some(FileType::Puffin(0)),
--- a/src/mito2/src/compaction.rs
+++ b/src/mito2/src/compaction.rs
@@ -62,7 +62,7 @@ use crate::read::projection::ProjectionMapper;
 use crate::read::scan_region::{PredicateGroup, ScanInput};
 use crate::read::seq_scan::SeqScan;
 use crate::read::{BoxedBatchReader, BoxedRecordBatchStream};
-use crate::region::options::MergeMode;
+use crate::region::options::{MergeMode, RegionOptions};
 use crate::region::version::VersionControlRef;
 use crate::region::{ManifestContextRef, RegionLeaderState, RegionRoleState};
 use crate::request::{OptionOutputTx, OutputTx, WorkerRequestWithTime};
@@ -311,9 +311,24 @@ impl CompactionScheduler {
        request: CompactionRequest,
        options: compact_request::Options,
    ) -> Result<()> {
+        let region_id = request.region_id();
+        let (dynamic_compaction_opts, ttl) = find_dynamic_options(
+            region_id.table_id(),
+            &request.current_version.options,
+            &request.schema_metadata_manager,
+        )
+        .await
+        .unwrap_or_else(|e| {
+            warn!(e; "Failed to find dynamic options for region: {}", region_id);
+            (
+                request.current_version.options.compaction.clone(),
+                request.current_version.options.ttl.unwrap_or_default(),
+            )
+        });
+
        let picker = new_picker(
            &options,
-            &request.current_version.options.compaction,
+            &dynamic_compaction_opts,
            request.current_version.options.append_mode,
            Some(self.engine_config.max_background_compactions),
        );
@@ -328,21 +343,10 @@ impl CompactionScheduler {
            cache_manager,
            manifest_ctx,
            listener,
-            schema_metadata_manager,
+            schema_metadata_manager: _,
            max_parallelism,
        } = request;

-        let ttl = find_ttl(
-            region_id.table_id(),
-            current_version.options.ttl,
-            &schema_metadata_manager,
-        )
-        .await
-        .unwrap_or_else(|e| {
-            warn!(e; "Failed to get ttl for region: {}", region_id);
-            TimeToLive::default()
-        });
-
        debug!(
            "Pick compaction strategy {:?} for region: {}, ttl: {:?}",
            picker, region_id, ttl
@@ -351,7 +355,10 @@ impl CompactionScheduler {
        let compaction_region = CompactionRegion {
            region_id,
            current_version: current_version.clone(),
-            region_options: current_version.options.clone(),
+            region_options: RegionOptions {
+                compaction: dynamic_compaction_opts.clone(),
+                ..current_version.options.clone()
+            },
            engine_config: engine_config.clone(),
            region_metadata: current_version.metadata.clone(),
            cache_manager: cache_manager.clone(),
@@ -382,7 +389,7 @@ impl CompactionScheduler {

        // If specified to run compaction remotely, we schedule the compaction job remotely.
        // It will fall back to local compaction if there is no remote job scheduler.
-        let waiters = if current_version.options.compaction.remote_compaction() {
+        let waiters = if dynamic_compaction_opts.remote_compaction() {
            if let Some(remote_job_scheduler) = &self.plugins.get::<RemoteJobSchedulerRef>() {
                let remote_compaction_job = CompactionJob {
                    compaction_region: compaction_region.clone(),
@@ -411,7 +418,7 @@ impl CompactionScheduler {
                        return Ok(());
                    }
                    Err(e) => {
-                        if !current_version.options.compaction.fallback_to_local() {
+                        if !dynamic_compaction_opts.fallback_to_local() {
                            error!(e; "Failed to schedule remote compaction job for region {}", region_id);
                            return RemoteCompactionSnafu {
                                region_id,
@@ -494,29 +501,88 @@ impl Drop for CompactionScheduler {
    }
 }

-/// Finds TTL of table by first examine table options then database options.
-async fn find_ttl(
+/// Finds compaction options and TTL together with a single metadata fetch to reduce RTT.
+async fn find_dynamic_options(
    table_id: TableId,
-    table_ttl: Option<TimeToLive>,
+    region_options: &crate::region::options::RegionOptions,
    schema_metadata_manager: &SchemaMetadataManagerRef,
-) -> Result<TimeToLive> {
-    // If table TTL is set, we use it.
-    if let Some(table_ttl) = table_ttl {
-        return Ok(table_ttl);
+) -> Result<(crate::region::options::CompactionOptions, TimeToLive)> {
+    if region_options.compaction_override && region_options.ttl.is_some() {
+        debug!(
+            "Use region options directly for table {}: compaction={:?}, ttl={:?}",
+            table_id, region_options.compaction, region_options.ttl
+        );
+        return Ok((
+            region_options.compaction.clone(),
+            region_options.ttl.unwrap(),
+        ));
    }

-    let ttl = tokio::time::timeout(
+    let db_options = tokio::time::timeout(
        crate::config::FETCH_OPTION_TIMEOUT,
        schema_metadata_manager.get_schema_options_by_table_id(table_id),
    )
    .await
    .context(TimeoutSnafu)?
-    .context(GetSchemaMetadataSnafu)?
-    .and_then(|options| options.ttl)
-    .unwrap_or_default()
-    .into();
+    .context(GetSchemaMetadataSnafu)?;

-    Ok(ttl)
+    let ttl = if region_options.ttl.is_some() {
+        debug!(
+            "Use region TTL directly for table {}: ttl={:?}",
+            table_id, region_options.ttl
+        );
+        region_options.ttl.unwrap()
+    } else {
+        db_options
+            .as_ref()
+            .and_then(|options| options.ttl)
+            .unwrap_or_default()
+            .into()
+    };
+
+    let compaction = if !region_options.compaction_override {
+        if let Some(schema_opts) = db_options {
+            let map: HashMap<String, String> = schema_opts
+                .extra_options
+                .iter()
+                .filter_map(|(k, v)| {
+                    if k.starts_with("compaction.") {
+                        Some((k.clone(), v.clone()))
+                    } else {
+                        None
+                    }
+                })
+                .collect();
+            if map.is_empty() {
+                region_options.compaction.clone()
+            } else {
+                crate::region::options::RegionOptions::try_from(&map)
+                    .map(|o| o.compaction)
+                    .unwrap_or_else(|e| {
+                        error!(e; "Failed to create RegionOptions from map");
+                        region_options.compaction.clone()
+                    })
+            }
+        } else {
+            debug!(
+                "DB options is None for table {}, use region compaction: compaction={:?}",
+                table_id, region_options.compaction
+            );
+            region_options.compaction.clone()
+        }
+    } else {
+        debug!(
+            "No schema options for table {}, use region compaction: compaction={:?}",
+            table_id, region_options.compaction
+        );
+        region_options.compaction.clone()
+    };
+
+    debug!(
+        "Resolved dynamic options for table {}: compaction={:?}, ttl={:?}",
+        table_id, compaction, ttl
+    );
+    Ok((compaction, ttl))
 }

 /// Status of running and pending region compaction tasks.
@@ -805,8 +871,12 @@ struct PendingCompaction {

 #[cfg(test)]
 mod tests {
+    use std::time::Duration;
+
    use api::v1::region::StrictWindow;
    use common_datasource::compression::CompressionType;
+    use common_meta::key::schema_name::SchemaNameValue;
+    use common_time::DatabaseTimeToLive;
    use tokio::sync::{Barrier, oneshot};

    use super::*;
@@ -818,6 +888,163 @@ mod tests {
    use crate::test_util::scheduler_util::{SchedulerEnv, VecScheduler};
    use crate::test_util::version_util::{VersionControlBuilder, apply_edit};

+    #[tokio::test]
+    async fn test_find_compaction_options_db_level() {
+        let env = SchedulerEnv::new().await;
+        let builder = VersionControlBuilder::new();
+        let (schema_metadata_manager, kv_backend) = mock_schema_metadata_manager();
+        let region_id = builder.region_id();
+        let table_id = region_id.table_id();
+        // Register table without ttl but with db-level compaction options
+        let mut schema_value = SchemaNameValue {
+            ttl: Some(DatabaseTimeToLive::default()),
+            ..Default::default()
+        };
+        schema_value
+            .extra_options
+            .insert("compaction.type".to_string(), "twcs".to_string());
+        schema_value
+            .extra_options
+            .insert("compaction.twcs.time_window".to_string(), "2h".to_string());
+        schema_metadata_manager
+            .register_region_table_info(
+                table_id,
+                "t",
+                "c",
+                "s",
+                Some(schema_value),
+                kv_backend.clone(),
+            )
+            .await;
+
+        let version_control = Arc::new(builder.build());
+        let region_opts = version_control.current().version.options.clone();
+        let (opts, _) = find_dynamic_options(table_id, &region_opts, &schema_metadata_manager)
+            .await
+            .unwrap();
+        match opts {
+            crate::region::options::CompactionOptions::Twcs(t) => {
+                assert_eq!(t.time_window_seconds(), Some(2 * 3600));
+            }
+        }
+        let manifest_ctx = env
+            .mock_manifest_context(version_control.current().version.metadata.clone())
+            .await;
+        let (tx, _rx) = mpsc::channel(4);
+        let mut scheduler = env.mock_compaction_scheduler(tx);
+        let (otx, _orx) = oneshot::channel();
+        let request = scheduler
+            .region_status
+            .entry(region_id)
+            .or_insert_with(|| {
+                crate::compaction::CompactionStatus::new(
+                    region_id,
+                    version_control.clone(),
+                    env.access_layer.clone(),
+                )
+            })
+            .new_compaction_request(
+                scheduler.request_sender.clone(),
+                OptionOutputTx::new(Some(OutputTx::new(otx))),
+                scheduler.engine_config.clone(),
+                scheduler.cache_manager.clone(),
+                &manifest_ctx,
+                scheduler.listener.clone(),
+                schema_metadata_manager.clone(),
+                1,
+            );
+        scheduler
+            .schedule_compaction_request(
+                request,
+                compact_request::Options::Regular(Default::default()),
+            )
+            .await
+            .unwrap();
+    }
+
+    #[tokio::test]
+    async fn test_find_compaction_options_priority() {
+        fn schema_value_with_twcs(time_window: &str) -> SchemaNameValue {
+            let mut schema_value = SchemaNameValue {
+                ttl: Some(DatabaseTimeToLive::default()),
+                ..Default::default()
+            };
+            schema_value
+                .extra_options
+                .insert("compaction.type".to_string(), "twcs".to_string());
+            schema_value.extra_options.insert(
+                "compaction.twcs.time_window".to_string(),
+                time_window.to_string(),
+            );
+            schema_value
+        }
+
+        let cases = [
+            (
+                "db options set and table override set",
+                Some(schema_value_with_twcs("2h")),
+                true,
+                Some(Duration::from_secs(5 * 3600)),
+                Some(5 * 3600),
+            ),
+            (
+                "db options set and table override not set",
+                Some(schema_value_with_twcs("2h")),
+                false,
+                None,
+                Some(2 * 3600),
+            ),
+            (
+                "db options not set and table override set",
+                None,
+                true,
+                Some(Duration::from_secs(4 * 3600)),
+                Some(4 * 3600),
+            ),
+            (
+                "db options not set and table override not set",
+                None,
+                false,
+                None,
+                None,
+            ),
+        ];
+
+        for (case_name, schema_value, override_set, table_window, expected_window) in cases {
+            let builder = VersionControlBuilder::new();
+            let (schema_metadata_manager, kv_backend) = mock_schema_metadata_manager();
+            let table_id = builder.region_id().table_id();
+            schema_metadata_manager
+                .register_region_table_info(
+                    table_id,
+                    "t",
+                    "c",
+                    "s",
+                    schema_value,
+                    kv_backend.clone(),
+                )
+                .await;
+
+            let version_control = Arc::new(builder.build());
+            let mut region_opts = version_control.current().version.options.clone();
+            region_opts.compaction_override = override_set;
+            if let Some(window) = table_window {
+                let crate::region::options::CompactionOptions::Twcs(twcs) =
+                    &mut region_opts.compaction;
+                twcs.time_window = Some(window);
+            }
+
+            let (opts, _) = find_dynamic_options(table_id, &region_opts, &schema_metadata_manager)
+                .await
+                .unwrap();
+            match opts {
+                crate::region::options::CompactionOptions::Twcs(t) => {
+                    assert_eq!(t.time_window_seconds(), expected_window, "{case_name}");
+                }
+            }
+        }
+    }
+
    #[tokio::test]
    async fn test_schedule_empty() {
        let env = SchedulerEnv::new().await;
--- a/src/mito2/src/compaction/compactor.rs
+++ b/src/mito2/src/compaction/compactor.rs
@@ -35,7 +35,7 @@ use crate::access_layer::{
 };
 use crate::cache::{CacheManager, CacheManagerRef};
 use crate::compaction::picker::{PickerOutput, new_picker};
-use crate::compaction::{CompactionOutput, CompactionSstReaderBuilder, find_ttl};
+use crate::compaction::{CompactionOutput, CompactionSstReaderBuilder, find_dynamic_options};
 use crate::config::MitoConfig;
 use crate::error::{
    EmptyRegionDirSnafu, InvalidPartitionExprSnafu, JoinSnafu, ObjectStoreNotFoundSnafu, Result,
@@ -203,16 +203,22 @@ pub async fn open_compaction_region(
        // Use the specified ttl.
        Either::Left(ttl) => ttl,
        // Get the ttl from the schema metadata manager.
-        Either::Right(schema_metadata_manager) => find_ttl(
-            req.region_id.table_id(),
-            current_version.options.ttl,
-            &schema_metadata_manager,
-        )
-        .await
-        .unwrap_or_else(|e| {
-            warn!(e; "Failed to get ttl for region: {}", region_metadata.region_id);
-            TimeToLive::default()
-        }),
+        Either::Right(schema_metadata_manager) => {
+            let (_, ttl) = find_dynamic_options(
+                req.region_id.table_id(),
+                &req.region_options,
+                &schema_metadata_manager,
+            )
+            .await
+            .unwrap_or_else(|e| {
+                warn!(e; "Failed to get ttl for region: {}", region_metadata.region_id);
+                (
+                    crate::region::options::CompactionOptions::default(),
+                    TimeToLive::default(),
+                )
+            });
+            ttl
+        }
    };

    Ok(CompactionRegion {
--- a/src/mito2/src/compaction/task.rs
+++ b/src/mito2/src/compaction/task.rs
@@ -162,6 +162,7 @@ impl CompactionTaskImpl {
                edit,
                result: Ok(()),
                update_region_state: false,
+                is_staging: false,
            }),
        })
        .await;
--- a/src/mito2/src/compaction/window.rs
+++ b/src/mito2/src/compaction/window.rs
@@ -244,6 +244,7 @@ mod tests {
            options: RegionOptions {
                ttl: ttl.map(|t| t.into()),
                compaction: Default::default(),
+                compaction_override: false,
                storage: None,
                append_mode: false,
                wal_options: Default::default(),
--- a/src/mito2/src/engine.rs
+++ b/src/mito2/src/engine.rs
@@ -76,6 +76,8 @@ mod copy_region_from_test;
 #[cfg(test)]
 mod remap_manifests_test;

+#[cfg(test)]
+mod apply_staging_manifest_test;
 mod puffin_index;

 use std::any::Any;
@@ -87,6 +89,7 @@ use api::region::RegionResponse;
 use async_trait::async_trait;
 use common_base::Plugins;
 use common_error::ext::BoxedError;
+use common_meta::error::UnexpectedSnafu;
 use common_meta::key::SchemaMetadataManagerRef;
 use common_recordbatch::{MemoryPermit, QueryMemoryTracker, SendableRecordBatchStream};
 use common_stat::get_total_memory_bytes;
@@ -105,10 +108,10 @@ use store_api::metric_engine_consts::{
    MANIFEST_INFO_EXTENSION_KEY, TABLE_COLUMN_METADATA_EXTENSION_KEY,
 };
 use store_api::region_engine::{
-    BatchResponses, CopyRegionFromRequest, CopyRegionFromResponse, MitoCopyRegionFromResponse,
-    RegionEngine, RegionManifestInfo, RegionRole, RegionScannerRef, RegionStatistic,
-    RemapManifestsRequest, RemapManifestsResponse, SetRegionRoleStateResponse,
-    SettableRegionRoleState, SyncManifestResponse,
+    BatchResponses, MitoCopyRegionFromRequest, MitoCopyRegionFromResponse, RegionEngine,
+    RegionManifestInfo, RegionRole, RegionScannerRef, RegionStatistic, RemapManifestsRequest,
+    RemapManifestsResponse, SetRegionRoleStateResponse, SettableRegionRoleState,
+    SyncRegionFromRequest, SyncRegionFromResponse,
 };
 use store_api::region_request::{
    AffectedRows, RegionCatchupRequest, RegionOpenRequest, RegionRequest,
@@ -122,8 +125,8 @@ use crate::cache::{CacheManagerRef, CacheStrategy};
 use crate::config::MitoConfig;
 use crate::engine::puffin_index::{IndexEntryContext, collect_index_entries_from_puffin};
 use crate::error::{
-    self, InvalidRequestSnafu, JoinSnafu, MitoManifestInfoSnafu, RecvSnafu, RegionNotFoundSnafu,
-    Result, SerdeJsonSnafu, SerializeColumnMetadataSnafu, SerializeManifestSnafu,
+    InvalidRequestSnafu, JoinSnafu, MitoManifestInfoSnafu, RecvSnafu, RegionNotFoundSnafu, Result,
+    SerdeJsonSnafu, SerializeColumnMetadataSnafu, SerializeManifestSnafu,
 };
 #[cfg(feature = "enterprise")]
 use crate::extension::BoxedExtensionRangeProviderFactory;
@@ -395,7 +398,7 @@ impl MitoEngine {
    }

    /// Edit region's metadata by [RegionEdit] directly. Use with care.
-    /// Now we only allow adding files to region (the [RegionEdit] struct can only contain a non-empty "files_to_add" field).
+    /// Now we only allow adding files or removing files from region (the [RegionEdit] struct can only contain a non-empty "files_to_add" or "files_to_remove" field).
    /// Other region editing intention will result in an "invalid request" error.
    /// Also note that if a region is to be edited directly, we MUST not write data to it thereafter.
    pub async fn edit_region(&self, region_id: RegionId, edit: RegionEdit) -> Result<()> {
@@ -430,7 +433,7 @@ impl MitoEngine {
    pub async fn copy_region_from(
        &self,
        region_id: RegionId,
-        request: CopyRegionFromRequest,
+        request: MitoCopyRegionFromRequest,
    ) -> Result<MitoCopyRegionFromResponse> {
        self.inner.copy_region_from(region_id, request).await
    }
@@ -639,8 +642,7 @@ impl MitoEngine {
 ///
 /// Only adding or removing files to region is considered valid now.
 fn is_valid_region_edit(edit: &RegionEdit) -> bool {
-    !edit.files_to_add.is_empty()
-        && edit.files_to_remove.is_empty()
+    (!edit.files_to_add.is_empty() || !edit.files_to_remove.is_empty())
        && matches!(
            edit,
            RegionEdit {
@@ -1073,7 +1075,7 @@ impl EngineInner {
    async fn copy_region_from(
        &self,
        region_id: RegionId,
-        request: CopyRegionFromRequest,
+        request: MitoCopyRegionFromRequest,
    ) -> Result<MitoCopyRegionFromResponse> {
        let (request, receiver) =
            WorkerRequest::try_from_copy_region_from_request(region_id, request)?;
@@ -1247,15 +1249,21 @@ impl RegionEngine for MitoEngine {
    async fn sync_region(
        &self,
        region_id: RegionId,
-        manifest_info: RegionManifestInfo,
-    ) -> Result<SyncManifestResponse, BoxedError> {
+        request: SyncRegionFromRequest,
+    ) -> Result<SyncRegionFromResponse, BoxedError> {
+        let manifest_info = request
+            .into_region_manifest_info()
+            .context(UnexpectedSnafu {
+                err_msg: "Expected a manifest info request",
+            })
+            .map_err(BoxedError::new)?;
        let (_, synced) = self
            .inner
            .sync_region(region_id, manifest_info)
            .await
            .map_err(BoxedError::new)?;

-        Ok(SyncManifestResponse::Mito { synced })
+        Ok(SyncRegionFromResponse::Mito { synced })
    }

    async fn remap_manifests(
@@ -1268,19 +1276,6 @@ impl RegionEngine for MitoEngine {
            .map_err(BoxedError::new)
    }

-    async fn copy_region_from(
-        &self,
-        _region_id: RegionId,
-        _request: CopyRegionFromRequest,
-    ) -> Result<CopyRegionFromResponse, BoxedError> {
-        Err(BoxedError::new(
-            error::UnsupportedOperationSnafu {
-                err_msg: "copy_region_from is not supported",
-            }
-            .build(),
-        ))
-    }
-
    fn role(&self, region_id: RegionId) -> Option<RegionRole> {
        self.inner.role(region_id)
    }
@@ -1419,7 +1414,7 @@ mod tests {
        };
        assert!(is_valid_region_edit(&edit));

-        // Invalid: "files_to_add" is empty
+        // Invalid: "files_to_add" and "files_to_remove" are both empty
        let edit = RegionEdit {
            files_to_add: vec![],
            files_to_remove: vec![],
@@ -1431,7 +1426,7 @@ mod tests {
        };
        assert!(!is_valid_region_edit(&edit));

-        // Invalid: "files_to_remove" is not empty
+        // Valid: "files_to_remove" is not empty
        let edit = RegionEdit {
            files_to_add: vec![FileMeta::default()],
            files_to_remove: vec![FileMeta::default()],
@@ -1441,7 +1436,7 @@ mod tests {
            flushed_sequence: None,
            committed_sequence: None,
        };
-        assert!(!is_valid_region_edit(&edit));
+        assert!(is_valid_region_edit(&edit));

        // Invalid: other fields are not all "None"s
        let edit = RegionEdit {
--- a/src/mito2/src/engine/apply_staging_manifest_test.rs
+++ b/src/mito2/src/engine/apply_staging_manifest_test.rs
@@ -0,0 +1,400 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::assert_matches::assert_matches;
+use std::fs;
+
+use api::v1::Rows;
+use datatypes::value::Value;
+use partition::expr::{PartitionExpr, col};
+use store_api::region_engine::{
+    RegionEngine, RegionRole, RemapManifestsRequest, SettableRegionRoleState,
+};
+use store_api::region_request::{
+    ApplyStagingManifestRequest, EnterStagingRequest, RegionFlushRequest, RegionRequest,
+};
+use store_api::storage::{FileId, RegionId};
+
+use crate::config::MitoConfig;
+use crate::error::Error;
+use crate::manifest::action::RegionManifest;
+use crate::sst::file::FileMeta;
+use crate::test_util::{CreateRequestBuilder, TestEnv, build_rows, put_rows, rows_schema};
+
+fn range_expr(col_name: &str, start: i64, end: i64) -> PartitionExpr {
+    col(col_name)
+        .gt_eq(Value::Int64(start))
+        .and(col(col_name).lt(Value::Int64(end)))
+}
+
+#[tokio::test]
+async fn test_apply_staging_manifest_invalid_region_state() {
+    common_telemetry::init_default_ut_logging();
+    test_apply_staging_manifest_invalid_region_state_with_format(false).await;
+    test_apply_staging_manifest_invalid_region_state_with_format(true).await;
+}
+
+async fn test_apply_staging_manifest_invalid_region_state_with_format(flat_format: bool) {
+    let mut env = TestEnv::with_prefix("invalid-region-state").await;
+    let engine = env
+        .create_engine(MitoConfig {
+            default_experimental_flat_format: flat_format,
+            ..Default::default()
+        })
+        .await;
+
+    let region_id = RegionId::new(1, 1);
+    let request = CreateRequestBuilder::new()
+        .partition_expr_json(Some(range_expr("x", 0, 50).as_json_str().unwrap()))
+        .build();
+    engine
+        .handle_request(region_id, RegionRequest::Create(request))
+        .await
+        .unwrap();
+
+    // Region is in leader state, apply staging manifest request should fail.
+    let err = engine
+        .handle_request(
+            region_id,
+            RegionRequest::ApplyStagingManifest(ApplyStagingManifestRequest {
+                partition_expr: range_expr("x", 0, 100).as_json_str().unwrap(),
+                files_to_add: vec![],
+            }),
+        )
+        .await
+        .unwrap_err();
+    assert_matches!(
+        err.into_inner().as_any().downcast_ref::<Error>().unwrap(),
+        Error::RegionState { .. }
+    );
+
+    // Region is in leader state, apply staging manifest request should fail.
+    engine
+        .set_region_role(region_id, RegionRole::Follower)
+        .unwrap();
+    let err = engine
+        .handle_request(
+            region_id,
+            RegionRequest::ApplyStagingManifest(ApplyStagingManifestRequest {
+                partition_expr: range_expr("x", 0, 100).as_json_str().unwrap(),
+                files_to_add: vec![],
+            }),
+        )
+        .await
+        .unwrap_err();
+    assert_matches!(
+        err.into_inner().as_any().downcast_ref::<Error>().unwrap(),
+        Error::RegionState { .. }
+    );
+}
+
+#[tokio::test]
+async fn test_apply_staging_manifest_mismatched_partition_expr() {
+    common_telemetry::init_default_ut_logging();
+    test_apply_staging_manifest_mismatched_partition_expr_with_format(false).await;
+    test_apply_staging_manifest_mismatched_partition_expr_with_format(true).await;
+}
+
+async fn test_apply_staging_manifest_mismatched_partition_expr_with_format(flat_format: bool) {
+    let mut env = TestEnv::with_prefix("mismatched-partition-expr").await;
+    let engine = env
+        .create_engine(MitoConfig {
+            default_experimental_flat_format: flat_format,
+            ..Default::default()
+        })
+        .await;
+
+    let region_id = RegionId::new(1, 1);
+    let request = CreateRequestBuilder::new().build();
+    engine
+        .handle_request(region_id, RegionRequest::Create(request))
+        .await
+        .unwrap();
+    engine
+        .handle_request(
+            region_id,
+            RegionRequest::EnterStaging(EnterStagingRequest {
+                partition_expr: range_expr("x", 0, 50).as_json_str().unwrap(),
+            }),
+        )
+        .await
+        .unwrap();
+
+    let err = engine
+        .handle_request(
+            region_id,
+            RegionRequest::ApplyStagingManifest(ApplyStagingManifestRequest {
+                partition_expr: range_expr("x", 0, 100).as_json_str().unwrap(),
+                files_to_add: vec![],
+            }),
+        )
+        .await
+        .unwrap_err();
+    assert_matches!(
+        err.into_inner().as_any().downcast_ref::<Error>().unwrap(),
+        Error::StagingPartitionExprMismatch { .. }
+    )
+}
+
+#[tokio::test]
+async fn test_apply_staging_manifest_success() {
+    common_telemetry::init_default_ut_logging();
+    test_apply_staging_manifest_success_with_format(false).await;
+    test_apply_staging_manifest_success_with_format(true).await;
+}
+
+async fn test_apply_staging_manifest_success_with_format(flat_format: bool) {
+    let mut env = TestEnv::with_prefix("success").await;
+    let engine = env
+        .create_engine(MitoConfig {
+            default_experimental_flat_format: flat_format,
+            ..Default::default()
+        })
+        .await;
+    let region_id = RegionId::new(1, 1);
+    let request = CreateRequestBuilder::new()
+        .partition_expr_json(Some(range_expr("tag_0", 0, 100).as_json_str().unwrap()))
+        .build();
+    let column_schemas = rows_schema(&request);
+    engine
+        .handle_request(region_id, RegionRequest::Create(request))
+        .await
+        .unwrap();
+    let new_region_id_1 = RegionId::new(1, 2);
+    let new_region_id_2 = RegionId::new(1, 3);
+    // Generate some data
+    for i in 0..3 {
+        let rows_data = Rows {
+            schema: column_schemas.clone(),
+            rows: build_rows(i * 10, (i + 1) * 10),
+        };
+        put_rows(&engine, region_id, rows_data).await;
+        engine
+            .handle_request(
+                region_id,
+                RegionRequest::Flush(RegionFlushRequest {
+                    row_group_size: None,
+                }),
+            )
+            .await
+            .unwrap();
+    }
+    engine
+        .set_region_role_state_gracefully(region_id, SettableRegionRoleState::StagingLeader)
+        .await
+        .unwrap();
+    let result = engine
+        .remap_manifests(RemapManifestsRequest {
+            region_id,
+            input_regions: vec![region_id],
+            region_mapping: [(region_id, vec![new_region_id_1, new_region_id_2])]
+                .into_iter()
+                .collect(),
+            new_partition_exprs: [
+                (
+                    new_region_id_1,
+                    range_expr("tag_0", 0, 50).as_json_str().unwrap(),
+                ),
+                (
+                    new_region_id_2,
+                    range_expr("tag_0", 50, 100).as_json_str().unwrap(),
+                ),
+            ]
+            .into_iter()
+            .collect(),
+        })
+        .await
+        .unwrap();
+    assert_eq!(result.new_manifests.len(), 2);
+    let new_manifest_1 =
+        serde_json::from_str::<RegionManifest>(&result.new_manifests[&new_region_id_1]).unwrap();
+    let new_manifest_2 =
+        serde_json::from_str::<RegionManifest>(&result.new_manifests[&new_region_id_2]).unwrap();
+    assert_eq!(new_manifest_1.files.len(), 3);
+    assert_eq!(new_manifest_2.files.len(), 3);
+
+    let request = CreateRequestBuilder::new().build();
+    engine
+        .handle_request(new_region_id_1, RegionRequest::Create(request))
+        .await
+        .unwrap();
+    engine
+        .handle_request(
+            new_region_id_1,
+            RegionRequest::EnterStaging(EnterStagingRequest {
+                partition_expr: range_expr("tag_0", 0, 50).as_json_str().unwrap(),
+            }),
+        )
+        .await
+        .unwrap();
+    let mut files_to_add = new_manifest_1.files.values().cloned().collect::<Vec<_>>();
+    // Before apply staging manifest, the files should be empty
+    let region = engine.get_region(new_region_id_1).unwrap();
+    let manifest = region.manifest_ctx.manifest().await;
+    assert_eq!(manifest.files.len(), 0);
+    let staging_manifest = region.manifest_ctx.staging_manifest().await.unwrap();
+    assert_eq!(staging_manifest.files.len(), 0);
+
+    engine
+        .handle_request(
+            new_region_id_1,
+            RegionRequest::ApplyStagingManifest(ApplyStagingManifestRequest {
+                partition_expr: range_expr("tag_0", 0, 50).as_json_str().unwrap(),
+                files_to_add: serde_json::to_vec(&files_to_add).unwrap(),
+            }),
+        )
+        .await
+        .unwrap();
+    // After apply staging manifest, the files should be the same as the new manifest
+    let region = engine.get_region(new_region_id_1).unwrap();
+    let manifest = region.manifest_ctx.manifest().await;
+    assert_eq!(manifest.files.len(), 3);
+    assert!(region.is_writable());
+    assert!(!region.is_staging());
+    // The manifest partition expr should be the same as the request.
+    assert_eq!(
+        manifest.metadata.partition_expr.as_ref().unwrap(),
+        &range_expr("tag_0", 0, 50).as_json_str().unwrap()
+    );
+    // The staging manifest should be cleared.
+    let staging_manifest = region.manifest_ctx.staging_manifest().await;
+    assert!(staging_manifest.is_none());
+    // The staging partition expr should be cleared.
+    assert!(region.staging_partition_expr.lock().unwrap().is_none());
+    // The staging manifest directory should be empty.
+    let data_home = env.data_home();
+    let region_dir = format!("{}/data/test/1_0000000001", data_home.display());
+    let staging_manifest_dir = format!("{}/staging/manifest", region_dir);
+    let staging_files = fs::read_dir(&staging_manifest_dir)
+        .map(|entries| entries.collect::<Result<Vec<_>, _>>().unwrap_or_default())
+        .unwrap_or_default();
+    assert_eq!(staging_files.len(), 0);
+
+    // Try to modify the file sequence.
+    files_to_add.push(FileMeta {
+        region_id,
+        file_id: FileId::random(),
+        ..Default::default()
+    });
+    // This request will be ignored.
+    engine
+        .handle_request(
+            new_region_id_1,
+            RegionRequest::ApplyStagingManifest(ApplyStagingManifestRequest {
+                partition_expr: range_expr("tag_0", 0, 50).as_json_str().unwrap(),
+                files_to_add: serde_json::to_vec(&files_to_add).unwrap(),
+            }),
+        )
+        .await
+        .unwrap();
+    // The files number should not change.
+    let region = engine.get_region(new_region_id_1).unwrap();
+    let manifest = region.manifest_ctx.manifest().await;
+    assert_eq!(manifest.files.len(), 3);
+}
+
+#[tokio::test]
+async fn test_apply_staging_manifest_invalid_files_to_add() {
+    common_telemetry::init_default_ut_logging();
+    test_apply_staging_manifest_invalid_files_to_add_with_format(false).await;
+    test_apply_staging_manifest_invalid_files_to_add_with_format(true).await;
+}
+
+async fn test_apply_staging_manifest_invalid_files_to_add_with_format(flat_format: bool) {
+    let mut env = TestEnv::with_prefix("invalid-files-to-add").await;
+    let engine = env
+        .create_engine(MitoConfig {
+            default_experimental_flat_format: flat_format,
+            ..Default::default()
+        })
+        .await;
+    let region_id = RegionId::new(1, 1);
+    let request = CreateRequestBuilder::new().build();
+    engine
+        .handle_request(region_id, RegionRequest::Create(request))
+        .await
+        .unwrap();
+    engine
+        .handle_request(
+            region_id,
+            RegionRequest::EnterStaging(EnterStagingRequest {
+                partition_expr: range_expr("tag_0", 0, 50).as_json_str().unwrap(),
+            }),
+        )
+        .await
+        .unwrap();
+    let err = engine
+        .handle_request(
+            region_id,
+            RegionRequest::ApplyStagingManifest(ApplyStagingManifestRequest {
+                partition_expr: range_expr("tag_0", 0, 50).as_json_str().unwrap(),
+                files_to_add: b"invalid".to_vec(),
+            }),
+        )
+        .await
+        .unwrap_err();
+    assert_matches!(
+        err.into_inner().as_any().downcast_ref::<Error>().unwrap(),
+        Error::SerdeJson { .. }
+    );
+}
+
+#[tokio::test]
+async fn test_apply_staging_manifest_empty_files() {
+    common_telemetry::init_default_ut_logging();
+    test_apply_staging_manifest_empty_files_with_format(false).await;
+    test_apply_staging_manifest_empty_files_with_format(true).await;
+}
+
+async fn test_apply_staging_manifest_empty_files_with_format(flat_format: bool) {
+    let mut env = TestEnv::with_prefix("empty-files").await;
+    let engine = env
+        .create_engine(MitoConfig {
+            default_experimental_flat_format: flat_format,
+            ..Default::default()
+        })
+        .await;
+    let region_id = RegionId::new(1, 1);
+    let request = CreateRequestBuilder::new().build();
+    engine
+        .handle_request(region_id, RegionRequest::Create(request))
+        .await
+        .unwrap();
+    engine
+        .handle_request(
+            region_id,
+            RegionRequest::EnterStaging(EnterStagingRequest {
+                partition_expr: range_expr("tag_0", 0, 50).as_json_str().unwrap(),
+            }),
+        )
+        .await
+        .unwrap();
+    engine
+        .handle_request(
+            region_id,
+            RegionRequest::ApplyStagingManifest(ApplyStagingManifestRequest {
+                partition_expr: range_expr("tag_0", 0, 50).as_json_str().unwrap(),
+                files_to_add: serde_json::to_vec::<Vec<FileMeta>>(&vec![]).unwrap(),
+            }),
+        )
+        .await
+        .unwrap();
+    let region = engine.get_region(region_id).unwrap();
+    let manifest = region.manifest_ctx.manifest().await;
+    assert_eq!(manifest.files.len(), 0);
+    let staging_manifest = region.manifest_ctx.staging_manifest().await;
+    assert!(staging_manifest.is_none());
+    let staging_partition_expr = region.staging_partition_expr.lock().unwrap();
+    assert!(staging_partition_expr.is_none());
+}
--- a/src/mito2/src/engine/copy_region_from_test.rs
+++ b/src/mito2/src/engine/copy_region_from_test.rs
@@ -20,7 +20,7 @@ use api::v1::Rows;
 use common_error::ext::ErrorExt;
 use common_error::status_code::StatusCode;
 use object_store::layers::mock::{Error as MockError, ErrorKind, MockLayerBuilder};
-use store_api::region_engine::{CopyRegionFromRequest, RegionEngine, RegionRole};
+use store_api::region_engine::{MitoCopyRegionFromRequest, RegionEngine, RegionRole};
 use store_api::region_request::{RegionFlushRequest, RegionRequest};
 use store_api::storage::RegionId;

@@ -89,7 +89,7 @@ async fn test_engine_copy_region_from_with_format(flat_format: bool, with_index:
    let resp = engine
        .copy_region_from(
            target_region_id,
-            CopyRegionFromRequest {
+            MitoCopyRegionFromRequest {
                source_region_id,
                parallelism: 1,
            },
@@ -126,7 +126,7 @@ async fn test_engine_copy_region_from_with_format(flat_format: bool, with_index:
    let resp2 = engine
        .copy_region_from(
            target_region_id,
-            CopyRegionFromRequest {
+            MitoCopyRegionFromRequest {
                source_region_id,
                parallelism: 1,
            },
@@ -207,7 +207,7 @@ async fn test_engine_copy_region_failure_with_format(flat_format: bool) {
    let err = engine
        .copy_region_from(
            target_region_id,
-            CopyRegionFromRequest {
+            MitoCopyRegionFromRequest {
                source_region_id,
                parallelism: 1,
            },
@@ -225,7 +225,6 @@ async fn test_engine_copy_region_failure_with_format(flat_format: bool) {
    let source_region_dir = format!("{}/data/test/1_0000000001", env.data_home().display());
    assert_file_num_in_dir(&source_region_dir, 1);
    assert_file_num_in_dir(&format!("{}/index", source_region_dir), 1);
-
    assert_eq!(
        source_region_files,
        collect_filename_in_dir(&source_region_dir)
@@ -298,7 +297,7 @@ async fn test_engine_copy_region_invalid_args_with_format(flat_format: bool) {
    let err = engine
        .copy_region_from(
            region_id,
-            CopyRegionFromRequest {
+            MitoCopyRegionFromRequest {
                source_region_id: RegionId::new(2, 1),
                parallelism: 1,
            },
@@ -309,7 +308,7 @@ async fn test_engine_copy_region_invalid_args_with_format(flat_format: bool) {
    let err = engine
        .copy_region_from(
            region_id,
-            CopyRegionFromRequest {
+            MitoCopyRegionFromRequest {
                source_region_id: RegionId::new(1, 1),
                parallelism: 1,
            },
@@ -347,7 +346,7 @@ async fn test_engine_copy_region_unexpected_state_with_format(flat_format: bool)
    let err = engine
        .copy_region_from(
            region_id,
-            CopyRegionFromRequest {
+            MitoCopyRegionFromRequest {
                source_region_id: RegionId::new(1, 2),
                parallelism: 1,
            },
--- a/src/mito2/src/engine/staging_test.rs
+++ b/src/mito2/src/engine/staging_test.rs
@@ -23,11 +23,13 @@ use api::v1::Rows;
 use common_error::ext::ErrorExt;
 use common_error::status_code::StatusCode;
 use common_recordbatch::RecordBatches;
+use datatypes::value::Value;
 use object_store::Buffer;
 use object_store::layers::mock::{
    Entry, Error as MockError, ErrorKind, List, Lister, Metadata, MockLayerBuilder,
    Result as MockResult, Write, Writer,
 };
+use partition::expr::{PartitionExpr, col};
 use store_api::region_engine::{RegionEngine, SettableRegionRoleState};
 use store_api::region_request::{
    EnterStagingRequest, RegionAlterRequest, RegionFlushRequest, RegionRequest,
@@ -38,10 +40,16 @@ use store_api::storage::{RegionId, ScanRequest};
 use crate::config::MitoConfig;
 use crate::engine::listener::NotifyEnterStagingResultListener;
 use crate::error::Error;
-use crate::region::{RegionLeaderState, RegionRoleState};
+use crate::region::{RegionLeaderState, RegionRoleState, parse_partition_expr};
 use crate::request::WorkerRequest;
 use crate::test_util::{CreateRequestBuilder, TestEnv, build_rows, put_rows, rows_schema};

+fn range_expr(col_name: &str, start: i64, end: i64) -> PartitionExpr {
+    col(col_name)
+        .gt_eq(Value::Int64(start))
+        .and(col(col_name).lt(Value::Int64(end)))
+}
+
 #[tokio::test]
 async fn test_staging_state_integration() {
    test_staging_state_integration_with_format(false).await;
@@ -227,7 +235,9 @@ async fn test_staging_state_validation_patterns() {
    );
 }

-const PARTITION_EXPR: &str = "partition_expr";
+fn default_partition_expr() -> String {
+    range_expr("a", 0, 100).as_json_str().unwrap()
+}

 #[tokio::test]
 async fn test_staging_manifest_directory() {
@@ -237,6 +247,7 @@ async fn test_staging_manifest_directory() {

 async fn test_staging_manifest_directory_with_format(flat_format: bool) {
    common_telemetry::init_default_ut_logging();
+    let partition_expr = default_partition_expr();
    let mut env = TestEnv::new().await;
    let engine = env
        .create_engine(MitoConfig {
@@ -274,14 +285,14 @@ async fn test_staging_manifest_directory_with_format(flat_format: bool) {
        .handle_request(
            region_id,
            RegionRequest::EnterStaging(EnterStagingRequest {
-                partition_expr: PARTITION_EXPR.to_string(),
+                partition_expr: partition_expr.clone(),
            }),
        )
        .await
        .unwrap();
    let region = engine.get_region(region_id).unwrap();
    let staging_partition_expr = region.staging_partition_expr.lock().unwrap().clone();
-    assert_eq!(staging_partition_expr.unwrap(), PARTITION_EXPR);
+    assert_eq!(staging_partition_expr.unwrap(), partition_expr);
    {
        let manager = region.manifest_ctx.manifest_manager.read().await;
        assert_eq!(
@@ -292,7 +303,7 @@ async fn test_staging_manifest_directory_with_format(flat_format: bool) {
                .partition_expr
                .as_deref()
                .unwrap(),
-            PARTITION_EXPR
+            &partition_expr,
        );
        assert!(manager.manifest().metadata.partition_expr.is_none());
    }
@@ -302,7 +313,7 @@ async fn test_staging_manifest_directory_with_format(flat_format: bool) {
        .handle_request(
            region_id,
            RegionRequest::EnterStaging(EnterStagingRequest {
-                partition_expr: PARTITION_EXPR.to_string(),
+                partition_expr: partition_expr.clone(),
            }),
        )
        .await
@@ -377,6 +388,7 @@ async fn test_staging_exit_success_with_manifests() {

 async fn test_staging_exit_success_with_manifests_with_format(flat_format: bool) {
    common_telemetry::init_default_ut_logging();
+    let partition_expr = default_partition_expr();
    let mut env = TestEnv::new().await;
    let engine = env
        .create_engine(MitoConfig {
@@ -407,7 +419,7 @@ async fn test_staging_exit_success_with_manifests_with_format(flat_format: bool)
        .handle_request(
            region_id,
            RegionRequest::EnterStaging(EnterStagingRequest {
-                partition_expr: PARTITION_EXPR.to_string(),
+                partition_expr: partition_expr.clone(),
            }),
        )
        .await
@@ -465,6 +477,25 @@ async fn test_staging_exit_success_with_manifests_with_format(flat_format: bool)
        "Staging manifest directory should contain 3 files before exit, got: {:?}",
        staging_files_before
    );
+    let region = engine.get_region(region_id).unwrap();
+    {
+        let manager = region.manifest_ctx.manifest_manager.read().await;
+        let staging_manifest = manager.staging_manifest().unwrap();
+        assert_eq!(staging_manifest.files.len(), 3);
+        assert_eq!(
+            staging_manifest.metadata.partition_expr.as_ref().unwrap(),
+            &partition_expr
+        );
+        let expr = parse_partition_expr(Some(partition_expr.as_str()))
+            .unwrap()
+            .unwrap();
+        for file in staging_manifest.files.values() {
+            let Some(file_expr) = file.partition_expr.as_ref() else {
+                continue;
+            };
+            assert_eq!(*file_expr, expr);
+        }
+    }

    // Count normal manifest files before exit
    let normal_manifest_dir = format!("{}/manifest", region_dir);
@@ -583,6 +614,7 @@ async fn test_write_stall_on_enter_staging() {

 async fn test_write_stall_on_enter_staging_with_format(flat_format: bool) {
    let mut env = TestEnv::new().await;
+    let partition_expr = default_partition_expr();
    let listener = Arc::new(NotifyEnterStagingResultListener::default());
    let engine = env
        .create_engine_with(
@@ -622,7 +654,7 @@ async fn test_write_stall_on_enter_staging_with_format(flat_format: bool) {
            .handle_request(
                region_id,
                RegionRequest::EnterStaging(EnterStagingRequest {
-                    partition_expr: PARTITION_EXPR.to_string(),
+                    partition_expr: partition_expr.clone(),
                }),
            )
            .await
@@ -706,6 +738,7 @@ impl Write for MockWriter {
 }

 async fn test_enter_staging_error(env: &mut TestEnv, flat_format: bool) {
+    let partition_expr = default_partition_expr();
    let engine = env
        .create_engine(MitoConfig {
            default_experimental_flat_format: flat_format,
@@ -723,7 +756,7 @@ async fn test_enter_staging_error(env: &mut TestEnv, flat_format: bool) {
        .handle_request(
            region_id,
            RegionRequest::EnterStaging(EnterStagingRequest {
-                partition_expr: PARTITION_EXPR.to_string(),
+                partition_expr: partition_expr.clone(),
            }),
        )
        .await
--- a/src/mito2/src/engine/sync_test.rs
+++ b/src/mito2/src/engine/sync_test.rs
@@ -153,7 +153,7 @@ async fn test_sync_after_flush_region_with_format(flat_format: bool) {
    // Returns error since the max manifest is 1
    let manifest_info = RegionManifestInfo::mito(2, 0, 0);
    let err = follower_engine
-        .sync_region(region_id, manifest_info)
+        .sync_region(region_id, manifest_info.into())
        .await
        .unwrap_err();
    let err = err.as_any().downcast_ref::<Error>().unwrap();
@@ -161,7 +161,7 @@ async fn test_sync_after_flush_region_with_format(flat_format: bool) {

    let manifest_info = RegionManifestInfo::mito(1, 0, 0);
    follower_engine
-        .sync_region(region_id, manifest_info)
+        .sync_region(region_id, manifest_info.into())
        .await
        .unwrap();
    common_telemetry::info!("Scan the region on the follower engine after sync");
@@ -266,7 +266,7 @@ async fn test_sync_after_alter_region_with_format(flat_format: bool) {
    // Sync the region from the leader engine to the follower engine
    let manifest_info = RegionManifestInfo::mito(2, 0, 0);
    follower_engine
-        .sync_region(region_id, manifest_info)
+        .sync_region(region_id, manifest_info.into())
        .await
        .unwrap();
    let expected = "\
--- a/src/mito2/src/flush.rs
+++ b/src/mito2/src/flush.rs
@@ -26,7 +26,7 @@ use either::Either;
 use partition::expr::PartitionExpr;
 use smallvec::{SmallVec, smallvec};
 use snafu::ResultExt;
-use store_api::storage::RegionId;
+use store_api::storage::{RegionId, SequenceNumber};
 use strum::IntoStaticStr;
 use tokio::sync::{Semaphore, mpsc, watch};

@@ -36,8 +36,8 @@ use crate::access_layer::{
 use crate::cache::CacheManagerRef;
 use crate::config::MitoConfig;
 use crate::error::{
-    Error, FlushRegionSnafu, InvalidPartitionExprSnafu, JoinSnafu, RegionClosedSnafu,
-    RegionDroppedSnafu, RegionTruncatedSnafu, Result,
+    Error, FlushRegionSnafu, JoinSnafu, RegionClosedSnafu, RegionDroppedSnafu,
+    RegionTruncatedSnafu, Result,
 };
 use crate::manifest::action::{RegionEdit, RegionMetaAction, RegionMetaActionList};
 use crate::memtable::{
@@ -54,7 +54,7 @@ use crate::read::merge::MergeReaderBuilder;
 use crate::read::{FlatSource, Source};
 use crate::region::options::{IndexOptions, MergeMode, RegionOptions};
 use crate::region::version::{VersionControlData, VersionControlRef, VersionRef};
-use crate::region::{ManifestContextRef, RegionLeaderState, RegionRoleState};
+use crate::region::{ManifestContextRef, RegionLeaderState, RegionRoleState, parse_partition_expr};
 use crate::request::{
    BackgroundNotify, FlushFailed, FlushFinished, OptionOutputTx, OutputTx, SenderBulkRequest,
    SenderDdlRequest, SenderWriteRequest, WorkerRequest, WorkerRequestWithTime,
@@ -252,6 +252,10 @@ pub(crate) struct RegionFlushTask {
    pub(crate) flush_semaphore: Arc<Semaphore>,
    /// Whether the region is in staging mode.
    pub(crate) is_staging: bool,
+    /// Partition expression of the region.
+    ///
+    /// This is used to generate the file meta.
+    pub(crate) partition_expr: Option<String>,
 }

 impl RegionFlushTask {
@@ -441,14 +445,8 @@ impl RegionFlushTask {
        let mut file_metas = Vec::with_capacity(memtables.len());
        let mut flushed_bytes = 0;
        let mut series_count = 0;
-        // Convert partition expression once outside the map
-        let partition_expr = match &version.metadata.partition_expr {
-            None => None,
-            Some(json_expr) if json_expr.is_empty() => None,
-            Some(json_str) => partition::expr::PartitionExpr::from_json_str(json_str)
-                .with_context(|_| InvalidPartitionExprSnafu { expr: json_str })?,
-        };
        let mut flush_metrics = Metrics::new(WriteType::Flush);
+        let partition_expr = parse_partition_expr(self.partition_expr.as_deref())?;
        for mem in memtables {
            if mem.is_empty() {
                // Skip empty memtables.
@@ -466,24 +464,26 @@ impl RegionFlushTask {
            // Sets `for_flush` flag to true.
            let mem_ranges = mem.ranges(None, RangesOptions::for_flush())?;
            let num_mem_ranges = mem_ranges.ranges.len();
-            let num_mem_rows = mem_ranges.stats.num_rows();
+
+            // Aggregate stats from all ranges
+            let num_mem_rows = mem_ranges.num_rows();
+            let memtable_series_count = mem_ranges.series_count();
            let memtable_id = mem.id();
            // Increases series count for each mem range. We consider each mem range has different series so
            // the counter may have more series than the actual series count.
-            series_count += mem_ranges.stats.series_count();
+            series_count += memtable_series_count;

            if mem_ranges.is_record_batch() {
                let flush_start = Instant::now();
                let FlushFlatMemResult {
                    num_encoded,
-                    max_sequence,
                    num_sources,
                    results,
                } = self
                    .flush_flat_mem_ranges(version, &write_opts, mem_ranges)
                    .await?;
                for (source_idx, result) in results.into_iter().enumerate() {
-                    let (ssts_written, metrics) = result?;
+                    let (max_sequence, ssts_written, metrics) = result?;
                    if ssts_written.is_empty() {
                        // No data written.
                        continue;
@@ -523,7 +523,7 @@ impl RegionFlushTask {
                    compact_cost,
                );
            } else {
-                let max_sequence = mem_ranges.stats.max_sequence();
+                let max_sequence = mem_ranges.max_sequence();
                let source = memtable_source(mem_ranges, &version.options).await?;

                // Flush to level 0.
@@ -585,8 +585,7 @@ impl RegionFlushTask {
        )?;
        let mut tasks = Vec::with_capacity(flat_sources.encoded.len() + flat_sources.sources.len());
        let num_encoded = flat_sources.encoded.len();
-        let max_sequence = flat_sources.max_sequence;
-        for source in flat_sources.sources {
+        for (source, max_sequence) in flat_sources.sources {
            let source = Either::Right(source);
            let write_request = self.new_write_request(version, max_sequence, source);
            let access_layer = self.access_layer.clone();
@@ -598,11 +597,11 @@ impl RegionFlushTask {
                let ssts = access_layer
                    .write_sst(write_request, &write_opts, &mut metrics)
                    .await?;
-                Ok((ssts, metrics))
+                Ok((max_sequence, ssts, metrics))
            });
            tasks.push(task);
        }
-        for encoded in flat_sources.encoded {
+        for (encoded, max_sequence) in flat_sources.encoded {
            let access_layer = self.access_layer.clone();
            let cache_manager = self.cache_manager.clone();
            let region_id = version.metadata.region_id;
@@ -612,7 +611,7 @@ impl RegionFlushTask {
                let metrics = access_layer
                    .put_sst(&encoded.data, region_id, &encoded.sst_info, &cache_manager)
                    .await?;
-                Ok((smallvec![encoded.sst_info], metrics))
+                Ok((max_sequence, smallvec![encoded.sst_info], metrics))
            });
            tasks.push(task);
        }
@@ -622,7 +621,6 @@ impl RegionFlushTask {
            .context(JoinSnafu)?;
        Ok(FlushFlatMemResult {
            num_encoded,
-            max_sequence,
            num_sources,
            results,
        })
@@ -698,9 +696,8 @@ impl RegionFlushTask {

 struct FlushFlatMemResult {
    num_encoded: usize,
-    max_sequence: u64,
    num_sources: usize,
-    results: Vec<Result<(SstInfoArray, Metrics)>>,
+    results: Vec<Result<(SequenceNumber, SstInfoArray, Metrics)>>,
 }

 struct DoFlushMemtablesResult {
@@ -746,9 +743,8 @@ async fn memtable_source(mem_ranges: MemtableRanges, options: &RegionOptions) ->
 }

 struct FlatSources {
-    max_sequence: u64,
-    sources: SmallVec<[FlatSource; 4]>,
-    encoded: SmallVec<[EncodedRange; 4]>,
+    sources: SmallVec<[(FlatSource, SequenceNumber); 4]>,
+    encoded: SmallVec<[(EncodedRange, SequenceNumber); 4]>,
 }

 /// Returns the max sequence and [FlatSource] for the given memtable.
@@ -758,18 +754,17 @@ fn memtable_flat_sources(
    options: &RegionOptions,
    field_column_start: usize,
 ) -> Result<FlatSources> {
-    let MemtableRanges { ranges, stats } = mem_ranges;
-    let max_sequence = stats.max_sequence();
+    let MemtableRanges { ranges } = mem_ranges;
    let mut flat_sources = FlatSources {
-        max_sequence,
        sources: SmallVec::new(),
        encoded: SmallVec::new(),
    };

    if ranges.len() == 1 {
        let only_range = ranges.into_values().next().unwrap();
+        let max_sequence = only_range.stats().max_sequence();
        if let Some(encoded) = only_range.encoded() {
-            flat_sources.encoded.push(encoded);
+            flat_sources.encoded.push((encoded, max_sequence));
        } else {
            let iter = only_range.build_record_batch_iter(None)?;
            // Dedup according to append mode and merge mode.
@@ -780,61 +775,138 @@ fn memtable_flat_sources(
                field_column_start,
                iter,
            );
-            flat_sources.sources.push(FlatSource::Iter(iter));
+            flat_sources
+                .sources
+                .push((FlatSource::Iter(iter), max_sequence));
        };
    } else {
-        let min_flush_rows = stats.num_rows / 8;
+        // Calculate total rows from all ranges for min_flush_rows calculation
+        let total_rows: usize = ranges.values().map(|r| r.stats().num_rows()).sum();
+        let min_flush_rows = total_rows / 8;
        let min_flush_rows = min_flush_rows.max(DEFAULT_ROW_GROUP_SIZE);
        let mut last_iter_rows = 0;
        let num_ranges = ranges.len();
        let mut input_iters = Vec::with_capacity(num_ranges);
+        let mut current_ranges = Vec::new();
        for (_range_id, range) in ranges {
            if let Some(encoded) = range.encoded() {
-                flat_sources.encoded.push(encoded);
+                let max_sequence = range.stats().max_sequence();
+                flat_sources.encoded.push((encoded, max_sequence));
                continue;
            }

            let iter = range.build_record_batch_iter(None)?;
            input_iters.push(iter);
            last_iter_rows += range.num_rows();
+            current_ranges.push(range);

            if last_iter_rows > min_flush_rows {
+                // Calculate max_sequence from all merged ranges
+                let max_sequence = current_ranges
+                    .iter()
+                    .map(|r| r.stats().max_sequence())
+                    .max()
+                    .unwrap_or(0);
+
                let maybe_dedup = merge_and_dedup(
                    &schema,
-                    options,
+                    options.append_mode,
+                    options.merge_mode(),
                    field_column_start,
                    std::mem::replace(&mut input_iters, Vec::with_capacity(num_ranges)),
                )?;

-                flat_sources.sources.push(FlatSource::Iter(maybe_dedup));
+                flat_sources
+                    .sources
+                    .push((FlatSource::Iter(maybe_dedup), max_sequence));
                last_iter_rows = 0;
+                current_ranges.clear();
            }
        }

        // Handle remaining iters.
        if !input_iters.is_empty() {
-            let maybe_dedup = merge_and_dedup(&schema, options, field_column_start, input_iters)?;
+            let max_sequence = current_ranges
+                .iter()
+                .map(|r| r.stats().max_sequence())
+                .max()
+                .unwrap_or(0);

-            flat_sources.sources.push(FlatSource::Iter(maybe_dedup));
+            let maybe_dedup = merge_and_dedup(
+                &schema,
+                options.append_mode,
+                options.merge_mode(),
+                field_column_start,
+                input_iters,
+            )?;
+
+            flat_sources
+                .sources
+                .push((FlatSource::Iter(maybe_dedup), max_sequence));
        }
    }

    Ok(flat_sources)
 }

-fn merge_and_dedup(
+/// Merges multiple record batch iterators and applies deduplication based on the specified mode.
+///
+/// This function is used during the flush process to combine data from multiple memtable ranges
+/// into a single stream while handling duplicate records according to the configured merge strategy.
+///
+/// # Arguments
+///
+/// * `schema` - The Arrow schema reference that defines the structure of the record batches
+/// * `append_mode` - When true, no deduplication is performed and all records are preserved.
+///                  This is used for append-only workloads where duplicate handling is not required.
+/// * `merge_mode` - The strategy used for deduplication when not in append mode:
+///   - `MergeMode::LastRow`: Keeps the last record for each primary key
+///   - `MergeMode::LastNonNull`: Keeps the last non-null values for each field
+/// * `field_column_start` - The starting column index for fields in the record batch.
+///                          Used when `MergeMode::LastNonNull` to identify which columns
+///                          contain field values versus primary key columns.
+/// * `input_iters` - A vector of record batch iterators to be merged and deduplicated
+///
+/// # Returns
+///
+/// Returns a boxed record batch iterator that yields the merged and potentially deduplicated
+/// record batches.
+///
+/// # Behavior
+///
+/// 1. Creates a `FlatMergeIterator` to merge all input iterators in sorted order based on
+///    primary key and timestamp
+/// 2. If `append_mode` is true, returns the merge iterator directly without deduplication
+/// 3. If `append_mode` is false, wraps the merge iterator with a `FlatDedupIterator` that
+///    applies the specified merge mode:
+///    - `LastRow`: Removes duplicate rows, keeping only the last one
+///    - `LastNonNull`: Removes duplicates but preserves the last non-null value for each field
+///
+/// # Examples
+///
+/// ```ignore
+/// let merged_iter = merge_and_dedup(
+///     &schema,
+///     false,  // not append mode, apply dedup
+///     MergeMode::LastRow,
+///     2,  // fields start at column 2 after primary key columns
+///     vec![iter1, iter2, iter3],
+/// )?;
+/// ```
+pub fn merge_and_dedup(
    schema: &SchemaRef,
-    options: &RegionOptions,
+    append_mode: bool,
+    merge_mode: MergeMode,
    field_column_start: usize,
    input_iters: Vec<BoxedRecordBatchIterator>,
 ) -> Result<BoxedRecordBatchIterator> {
    let merge_iter = FlatMergeIterator::new(schema.clone(), input_iters, DEFAULT_READ_BATCH_SIZE)?;
-    let maybe_dedup = if options.append_mode {
+    let maybe_dedup = if append_mode {
        // No dedup in append mode
        Box::new(merge_iter) as _
    } else {
        // Dedup according to merge mode.
-        match options.merge_mode() {
+        match merge_mode {
            MergeMode::LastRow => {
                Box::new(FlatDedupIterator::new(merge_iter, FlatLastRow::new(false))) as _
            }
@@ -1281,6 +1353,7 @@ mod tests {
            index_options: IndexOptions::default(),
            flush_semaphore: Arc::new(Semaphore::new(2)),
            is_staging: false,
+            partition_expr: None,
        };
        task.push_sender(OptionOutputTx::from(output_tx));
        scheduler
@@ -1324,6 +1397,7 @@ mod tests {
                index_options: IndexOptions::default(),
                flush_semaphore: Arc::new(Semaphore::new(2)),
                is_staging: false,
+                partition_expr: None,
            })
            .collect();
        // Schedule first task.
@@ -1439,7 +1513,7 @@ mod tests {

            // Consume the iterator and count rows
            let mut total_rows = 0usize;
-            for source in flat_sources.sources {
+            for (source, _sequence) in flat_sources.sources {
                match source {
                    crate::read::FlatSource::Iter(iter) => {
                        for rb in iter {
@@ -1469,7 +1543,7 @@ mod tests {
            assert_eq!(1, flat_sources.sources.len());

            let mut total_rows = 0usize;
-            for source in flat_sources.sources {
+            for (source, _sequence) in flat_sources.sources {
                match source {
                    crate::read::FlatSource::Iter(iter) => {
                        for rb in iter {
@@ -1516,6 +1590,7 @@ mod tests {
                index_options: IndexOptions::default(),
                flush_semaphore: Arc::new(Semaphore::new(2)),
                is_staging: false,
+                partition_expr: None,
            })
            .collect();
        // Schedule first task.
--- a/src/mito2/src/gc.rs
+++ b/src/mito2/src/gc.rs
@@ -28,28 +28,62 @@ use std::time::Duration;
 use common_meta::datanode::GcStat;
 use common_telemetry::{debug, error, info, warn};
 use common_time::Timestamp;
+use itertools::Itertools;
 use object_store::{Entry, Lister};
 use serde::{Deserialize, Serialize};
 use snafu::ResultExt as _;
-use store_api::storage::{FileId, FileRefsManifest, GcReport, RegionId};
+use store_api::storage::{FileId, FileRef, FileRefsManifest, GcReport, IndexVersion, RegionId};
 use tokio::sync::{OwnedSemaphorePermit, TryAcquireError};
 use tokio_stream::StreamExt;

 use crate::access_layer::AccessLayerRef;
 use crate::cache::CacheManagerRef;
+use crate::cache::file_cache::FileType;
 use crate::config::MitoConfig;
 use crate::error::{
    DurationOutOfRangeSnafu, JoinSnafu, OpenDalSnafu, Result, TooManyGcJobsSnafu, UnexpectedSnafu,
 };
-use crate::manifest::action::RegionManifest;
-use crate::metrics::GC_DELETE_FILE_CNT;
+use crate::manifest::action::{RegionManifest, RemovedFile};
+use crate::metrics::{GC_DELETE_FILE_CNT, GC_ORPHANED_INDEX_FILES, GC_SKIPPED_UNPARSABLE_FILES};
 use crate::region::{MitoRegionRef, RegionRoleState};
-use crate::sst::file::delete_files;
+use crate::sst::file::{RegionFileId, RegionIndexId, delete_files, delete_index};
 use crate::sst::location::{self};

 #[cfg(test)]
 mod worker_test;

+/// Helper function to determine if a file should be deleted based on common logic
+/// shared between Parquet and Puffin file types.
+fn should_delete_file(
+    is_in_manifest: bool,
+    is_in_tmp_ref: bool,
+    is_linger: bool,
+    is_eligible_for_delete: bool,
+    entry: &Entry,
+    unknown_file_may_linger_until: chrono::DateTime<chrono::Utc>,
+) -> bool {
+    let is_known = is_linger || is_eligible_for_delete;
+
+    let is_unknown_linger_time_exceeded = || {
+        // if the file's expel time is unknown(because not appear in delta manifest), we keep it for a while
+        // using it's last modified time
+        // notice unknown files use a different lingering time
+        entry
+            .metadata()
+            .last_modified()
+            .map(|t| t < unknown_file_may_linger_until)
+            .unwrap_or(false)
+    };
+
+    !is_in_manifest
+        && !is_in_tmp_ref
+        && if is_known {
+            is_eligible_for_delete
+        } else {
+            is_unknown_linger_time_exceeded()
+        }
+}
+
 /// Limit the amount of concurrent GC jobs on the datanode
 pub struct GcLimiter {
    pub gc_job_limit: Arc<tokio::sync::Semaphore>,
@@ -208,7 +242,7 @@ impl LocalGcWorker {
    }

    /// Get tmp ref files for all current regions
-    pub async fn read_tmp_ref_files(&self) -> Result<HashMap<RegionId, HashSet<FileId>>> {
+    pub async fn read_tmp_ref_files(&self) -> Result<HashMap<RegionId, HashSet<FileRef>>> {
        let mut tmp_ref_files = HashMap::new();
        for (region_id, file_refs) in &self.file_ref_manifest.file_refs {
            tmp_ref_files
@@ -230,6 +264,7 @@ impl LocalGcWorker {
        let now = std::time::Instant::now();

        let mut deleted_files = HashMap::new();
+        let mut deleted_indexes = HashMap::new();
        let tmp_ref_files = self.read_tmp_ref_files().await?;
        for (region_id, region) in &self.regions {
            let per_region_time = std::time::Instant::now();
@@ -247,7 +282,12 @@ impl LocalGcWorker {
                .cloned()
                .unwrap_or_else(HashSet::new);
            let files = self.do_region_gc(region.clone(), &tmp_ref_files).await?;
-            deleted_files.insert(*region_id, files);
+            let index_files = files
+                .iter()
+                .filter_map(|f| f.index_version().map(|v| (f.file_id(), v)))
+                .collect_vec();
+            deleted_files.insert(*region_id, files.into_iter().map(|f| f.file_id()).collect());
+            deleted_indexes.insert(*region_id, index_files);
            debug!(
                "GC for region {} took {} secs.",
                region_id,
@@ -260,6 +300,7 @@ impl LocalGcWorker {
        );
        let report = GcReport {
            deleted_files,
+            deleted_indexes,
            need_retry_regions: HashSet::new(),
        };
        Ok(report)
@@ -282,8 +323,8 @@ impl LocalGcWorker {
    pub async fn do_region_gc(
        &self,
        region: MitoRegionRef,
-        tmp_ref_files: &HashSet<FileId>,
-    ) -> Result<Vec<FileId>> {
+        tmp_ref_files: &HashSet<FileRef>,
+    ) -> Result<Vec<RemovedFile>> {
        let region_id = region.region_id();

        debug!("Doing gc for region {}", region_id);
@@ -311,64 +352,83 @@ impl LocalGcWorker {
            .map(|s| s.len())
            .sum::<usize>();

-        let in_used: HashSet<FileId> = current_files
-            .keys()
-            .cloned()
-            .chain(tmp_ref_files.clone().into_iter())
-            .collect();
+        let in_manifest = current_files
+            .iter()
+            .map(|(file_id, meta)| (*file_id, meta.index_version()))
+            .collect::<HashMap<_, _>>();

-        let unused_files = self
-            .list_to_be_deleted_files(region_id, &in_used, recently_removed_files, all_entries)
+        let in_tmp_ref = tmp_ref_files
+            .iter()
+            .map(|file_ref| (file_ref.file_id, file_ref.index_version))
+            .collect::<HashSet<_>>();
+
+        let deletable_files = self
+            .list_to_be_deleted_files(
+                region_id,
+                &in_manifest,
+                &in_tmp_ref,
+                recently_removed_files,
+                all_entries,
+            )
            .await?;

-        let unused_file_cnt = unused_files.len();
+        let unused_file_cnt = deletable_files.len();

        debug!(
-            "gc: for region {region_id}: In manifest files: {}, Tmp ref file cnt: {}, In-used files: {}, recently removed files: {}, Unused files to delete: {} ",
+            "gc: for region {region_id}: In manifest files: {}, Tmp ref file cnt: {}, recently removed files: {}, Unused files to delete: {} ",
            current_files.len(),
            tmp_ref_files.len(),
-            in_used.len(),
            removed_file_cnt,
-            unused_files.len()
+            deletable_files.len()
        );

-        // TODO(discord9): for now, ignore async index file as it's design is not stable, need to be improved once
-        // index file design is stable
-        let file_pairs: Vec<(FileId, u64)> =
-            unused_files.iter().map(|file_id| (*file_id, 0)).collect();
-        // TODO(discord9): gc worker need another major refactor to support versioned index files
-
        debug!(
            "Found {} unused index files to delete for region {}",
-            file_pairs.len(),
+            deletable_files.len(),
            region_id
        );

-        self.delete_files(region_id, &file_pairs).await?;
+        self.delete_files(region_id, &deletable_files).await?;

        debug!(
            "Successfully deleted {} unused files for region {}",
            unused_file_cnt, region_id
        );
-        // TODO(discord9): update region manifest about deleted files
-        self.update_manifest_removed_files(&region, unused_files.clone())
+        self.update_manifest_removed_files(&region, deletable_files.clone())
            .await?;

-        Ok(unused_files)
+        Ok(deletable_files)
    }

-    async fn delete_files(&self, region_id: RegionId, file_ids: &[(FileId, u64)]) -> Result<()> {
+    async fn delete_files(&self, region_id: RegionId, removed_files: &[RemovedFile]) -> Result<()> {
+        let mut index_ids = vec![];
+        let file_pairs = removed_files
+            .iter()
+            .filter_map(|f| match f {
+                RemovedFile::File(file_id, v) => Some((*file_id, v.unwrap_or(0))),
+                RemovedFile::Index(file_id, index_version) => {
+                    let region_index_id =
+                        RegionIndexId::new(RegionFileId::new(region_id, *file_id), *index_version);
+                    index_ids.push(region_index_id);
+                    None
+                }
+            })
+            .collect_vec();
        delete_files(
            region_id,
-            file_ids,
+            &file_pairs,
            true,
            &self.access_layer,
            &self.cache_manager,
        )
        .await?;

+        for index_id in index_ids {
+            delete_index(index_id, &self.access_layer, &self.cache_manager).await?;
+        }
+
        // FIXME(discord9): if files are already deleted before calling delete_files, the metric will be inaccurate, no clean way to fix it now
-        GC_DELETE_FILE_CNT.add(file_ids.len() as i64);
+        GC_DELETE_FILE_CNT.add(removed_files.len() as i64);

        Ok(())
    }
@@ -377,7 +437,7 @@ impl LocalGcWorker {
    async fn update_manifest_removed_files(
        &self,
        region: &MitoRegionRef,
-        deleted_files: Vec<FileId>,
+        deleted_files: Vec<RemovedFile>,
    ) -> Result<()> {
        let deleted_file_cnt = deleted_files.len();
        debug!(
@@ -403,12 +463,12 @@ impl LocalGcWorker {
    pub async fn get_removed_files_expel_times(
        &self,
        region_manifest: &Arc<RegionManifest>,
-    ) -> Result<BTreeMap<Timestamp, HashSet<FileId>>> {
+    ) -> Result<BTreeMap<Timestamp, HashSet<RemovedFile>>> {
        let mut ret = BTreeMap::new();
        for files in &region_manifest.removed_files.removed_files {
            let expel_time = Timestamp::new_millisecond(files.removed_at);
            let set = ret.entry(expel_time).or_insert_with(HashSet::new);
-            set.extend(files.file_ids.iter().cloned());
+            set.extend(files.files.iter().cloned());
        }

        Ok(ret)
@@ -535,75 +595,136 @@ impl LocalGcWorker {
        Ok(all_entries)
    }

-    /// Filter files to determine which ones can be deleted based on usage status and lingering time.
-    /// Returns a vector of file IDs that are safe to delete.
    fn filter_deletable_files(
        &self,
        entries: Vec<Entry>,
-        in_use_filenames: &HashSet<FileId>,
-        may_linger_filenames: &HashSet<&FileId>,
-        eligible_for_removal: &HashSet<&FileId>,
+        in_manifest: &HashMap<FileId, Option<IndexVersion>>,
+        in_tmp_ref: &HashSet<(FileId, Option<IndexVersion>)>,
+        may_linger_files: &HashSet<&RemovedFile>,
+        eligible_for_delete: &HashSet<&RemovedFile>,
        unknown_file_may_linger_until: chrono::DateTime<chrono::Utc>,
-    ) -> (Vec<FileId>, HashSet<FileId>) {
-        let mut all_unused_files_ready_for_delete = vec![];
-        let mut all_in_exist_linger_files = HashSet::new();
+    ) -> Vec<RemovedFile> {
+        let mut ready_for_delete = vec![];
+        // all group by file id for easier checking
+        let in_tmp_ref: HashMap<FileId, HashSet<IndexVersion>> =
+            in_tmp_ref
+                .iter()
+                .fold(HashMap::new(), |mut acc, (file, version)| {
+                    let indices = acc.entry(*file).or_default();
+                    if let Some(version) = version {
+                        indices.insert(*version);
+                    }
+                    acc
+                });
+
+        let may_linger_files: HashMap<FileId, HashSet<&RemovedFile>> = may_linger_files
+            .iter()
+            .fold(HashMap::new(), |mut acc, file| {
+                let indices = acc.entry(file.file_id()).or_default();
+                indices.insert(file);
+                acc
+            });
+
+        let eligible_for_delete: HashMap<FileId, HashSet<&RemovedFile>> = eligible_for_delete
+            .iter()
+            .fold(HashMap::new(), |mut acc, file| {
+                let indices = acc.entry(file.file_id()).or_default();
+                indices.insert(file);
+                acc
+            });

        for entry in entries {
-            let file_id = match location::parse_file_id_from_path(entry.name()) {
-                Ok(file_id) => file_id,
+            let (file_id, file_type) = match location::parse_file_id_type_from_path(entry.name()) {
+                Ok((file_id, file_type)) => (file_id, file_type),
                Err(err) => {
                    error!(err; "Failed to parse file id from path: {}", entry.name());
                    // if we can't parse the file id, it means it's not a sst or index file
                    // shouldn't delete it because we don't know what it is
+                    GC_SKIPPED_UNPARSABLE_FILES.inc();
                    continue;
                }
            };

-            if may_linger_filenames.contains(&file_id) {
-                all_in_exist_linger_files.insert(file_id);
-            }
+            let should_delete = match file_type {
+                FileType::Parquet => {
+                    let is_in_manifest = in_manifest.contains_key(&file_id);
+                    let is_in_tmp_ref = in_tmp_ref.contains_key(&file_id);
+                    let is_linger = may_linger_files.contains_key(&file_id);
+                    let is_eligible_for_delete = eligible_for_delete.contains_key(&file_id);

-            let should_delete = !in_use_filenames.contains(&file_id)
-                && !may_linger_filenames.contains(&file_id)
-                && {
-                    if !eligible_for_removal.contains(&file_id) {
-                        // if the file's expel time is unknown(because not appear in delta manifest), we keep it for a while
-                        // using it's last modified time
-                        // notice unknown files use a different lingering time
-                        entry
-                            .metadata()
-                            .last_modified()
-                            .map(|t| t < unknown_file_may_linger_until)
-                            .unwrap_or(false)
-                    } else {
-                        // if the file did appear in manifest delta(and passes previous predicate), we can delete it immediately
-                        true
-                    }
-                };
+                    should_delete_file(
+                        is_in_manifest,
+                        is_in_tmp_ref,
+                        is_linger,
+                        is_eligible_for_delete,
+                        &entry,
+                        unknown_file_may_linger_until,
+                    )
+                }
+                FileType::Puffin(version) => {
+                    // notice need to check both file id and version
+                    let is_in_manifest = in_manifest
+                        .get(&file_id)
+                        .map(|opt_ver| *opt_ver == Some(version))
+                        .unwrap_or(false);
+                    let is_in_tmp_ref = in_tmp_ref
+                        .get(&file_id)
+                        .map(|versions| versions.contains(&version))
+                        .unwrap_or(false);
+                    let is_linger = may_linger_files
+                        .get(&file_id)
+                        .map(|files| files.contains(&&RemovedFile::Index(file_id, version)))
+                        .unwrap_or(false);
+                    let is_eligible_for_delete = eligible_for_delete
+                        .get(&file_id)
+                        .map(|files| files.contains(&&RemovedFile::Index(file_id, version)))
+                        .unwrap_or(false);
+
+                    should_delete_file(
+                        is_in_manifest,
+                        is_in_tmp_ref,
+                        is_linger,
+                        is_eligible_for_delete,
+                        &entry,
+                        unknown_file_may_linger_until,
+                    )
+                }
+            };

            if should_delete {
-                all_unused_files_ready_for_delete.push(file_id);
+                let removed_file = match file_type {
+                    FileType::Parquet => {
+                        // notice this cause we don't track index version for parquet files
+                        // since entries comes from listing, we can't get index version from path
+                        RemovedFile::File(file_id, None)
+                    }
+                    FileType::Puffin(version) => {
+                        GC_ORPHANED_INDEX_FILES.inc();
+                        RemovedFile::Index(file_id, version)
+                    }
+                };
+                ready_for_delete.push(removed_file);
            }
        }
-
-        (all_unused_files_ready_for_delete, all_in_exist_linger_files)
+        ready_for_delete
    }

-    /// Concurrently list unused files in the region dir
-    /// because there may be a lot of files in the region dir
-    /// and listing them may take a long time.
+    /// List files to be deleted based on their presence in the manifest, temporary references, and recently removed files.
+    /// Returns a vector of `RemovedFile` that are eligible for deletion.
+    ///
+    /// When `full_file_listing` is false, this method will only delete (subset of) files tracked in
+    /// `recently_removed_files`, which significantly
+    /// improves performance. When `full_file_listing` is true, it read from `all_entries` to find
+    /// and delete orphan files (files not tracked in the manifest).
    ///
-    /// When `full_file_listing` is false, this method will only delete files tracked in
-    /// `recently_removed_files` without performing expensive list operations, which significantly
-    /// improves performance. When `full_file_listing` is true, it performs a full listing to
-    /// find and delete orphan files.
    pub async fn list_to_be_deleted_files(
        &self,
        region_id: RegionId,
-        in_used: &HashSet<FileId>,
-        recently_removed_files: BTreeMap<Timestamp, HashSet<FileId>>,
+        in_manifest: &HashMap<FileId, Option<IndexVersion>>,
+        in_tmp_ref: &HashSet<(FileId, Option<IndexVersion>)>,
+        recently_removed_files: BTreeMap<Timestamp, HashSet<RemovedFile>>,
        all_entries: Vec<Entry>,
-    ) -> Result<Vec<FileId>> {
+    ) -> Result<Vec<RemovedFile>> {
        let now = chrono::Utc::now();
        let may_linger_until = self
            .opt
@@ -634,8 +755,10 @@ impl LocalGcWorker {
        };
        debug!("may_linger_files: {:?}", may_linger_files);

-        let may_linger_filenames = may_linger_files.values().flatten().collect::<HashSet<_>>();
+        let all_may_linger_files = may_linger_files.values().flatten().collect::<HashSet<_>>();

+        // known files(tracked in removed files field) that are eligible for removal
+        // (passed lingering time)
        let eligible_for_removal = recently_removed_files
            .values()
            .flatten()
@@ -646,12 +769,24 @@ impl LocalGcWorker {
        if !self.full_file_listing {
            // Only delete files that:
            // 1. Are in recently_removed_files (tracked in manifest)
-            // 2. Are not in use
+            // 2. Are not in use(in manifest or tmp ref)
            // 3. Have passed the lingering time
-            let files_to_delete: Vec<FileId> = eligible_for_removal
+            let files_to_delete: Vec<RemovedFile> = eligible_for_removal
                .iter()
-                .filter(|file_id| !in_used.contains(*file_id))
-                .map(|&f| *f)
+                .filter(|file_id| {
+                    let in_use = match file_id {
+                        RemovedFile::File(file_id, index_version) => {
+                            in_manifest.get(file_id) == Some(index_version)
+                                || in_tmp_ref.contains(&(*file_id, *index_version))
+                        }
+                        RemovedFile::Index(file_id, index_version) => {
+                            in_manifest.get(file_id) == Some(&Some(*index_version))
+                                || in_tmp_ref.contains(&(*file_id, Some(*index_version)))
+                        }
+                    };
+                    !in_use
+                })
+                .map(|&f| f.clone())
                .collect();

            info!(
@@ -666,16 +801,14 @@ impl LocalGcWorker {
        // Full file listing mode: get the full list of files from object store

        // Step 3: Filter files to determine which ones can be deleted
-        let (all_unused_files_ready_for_delete, all_in_exist_linger_files) = self
-            .filter_deletable_files(
-                all_entries,
-                in_used,
-                &may_linger_filenames,
-                &eligible_for_removal,
-                unknown_file_may_linger_until,
-            );
-
-        debug!("All in exist linger files: {:?}", all_in_exist_linger_files);
+        let all_unused_files_ready_for_delete = self.filter_deletable_files(
+            all_entries,
+            in_manifest,
+            in_tmp_ref,
+            &all_may_linger_files,
+            &eligible_for_removal,
+            unknown_file_may_linger_until,
+        );

        Ok(all_unused_files_ready_for_delete)
    }
--- a/src/mito2/src/gc/worker_test.rs
+++ b/src/mito2/src/gc/worker_test.rs
@@ -19,12 +19,13 @@ use api::v1::Rows;
 use common_telemetry::init_default_ut_logging;
 use store_api::region_engine::RegionEngine as _;
 use store_api::region_request::{RegionCompactRequest, RegionRequest};
-use store_api::storage::{FileRefsManifest, RegionId};
+use store_api::storage::{FileRef, FileRefsManifest, RegionId};

 use crate::config::MitoConfig;
 use crate::engine::MitoEngine;
 use crate::engine::compaction_test::{delete_and_flush, put_and_flush};
 use crate::gc::{GcConfig, LocalGcWorker};
+use crate::manifest::action::RemovedFile;
 use crate::region::MitoRegionRef;
 use crate::test_util::{
    CreateRequestBuilder, TestEnv, build_rows, flush_region, put_rows, rows_schema,
@@ -120,9 +121,9 @@ async fn test_gc_worker_basic_truncate() {
    let manifest = region.manifest_ctx.manifest().await;
    assert!(
        manifest.removed_files.removed_files[0]
-            .file_ids
-            .contains(&to_be_deleted_file_id)
-            && manifest.removed_files.removed_files[0].file_ids.len() == 1
+            .files
+            .contains(&RemovedFile::File(to_be_deleted_file_id, None))
+            && manifest.removed_files.removed_files[0].files.len() == 1
            && manifest.files.is_empty(),
        "Manifest after truncate: {:?}",
        manifest
@@ -214,9 +215,9 @@ async fn test_gc_worker_truncate_with_ref() {
    let manifest = region.manifest_ctx.manifest().await;
    assert!(
        manifest.removed_files.removed_files[0]
-            .file_ids
-            .contains(&to_be_deleted_file_id)
-            && manifest.removed_files.removed_files[0].file_ids.len() == 1
+            .files
+            .contains(&RemovedFile::File(to_be_deleted_file_id, None))
+            && manifest.removed_files.removed_files[0].files.len() == 1
            && manifest.files.is_empty(),
        "Manifest after truncate: {:?}",
        manifest
@@ -225,7 +226,11 @@ async fn test_gc_worker_truncate_with_ref() {

    let regions = BTreeMap::from([(region_id, region.clone())]);
    let file_ref_manifest = FileRefsManifest {
-        file_refs: [(region_id, HashSet::from([to_be_deleted_file_id]))].into(),
+        file_refs: [(
+            region_id,
+            HashSet::from([FileRef::new(region_id, to_be_deleted_file_id, None)]),
+        )]
+        .into(),
        manifest_version: [(region_id, version)].into(),
    };
    let gc_worker = create_gc_worker(&engine, regions, &file_ref_manifest, true).await;
@@ -235,7 +240,7 @@ async fn test_gc_worker_truncate_with_ref() {

    let manifest = region.manifest_ctx.manifest().await;
    assert!(
-        manifest.removed_files.removed_files[0].file_ids.len() == 1 && manifest.files.is_empty(),
+        manifest.removed_files.removed_files[0].files.len() == 1 && manifest.files.is_empty(),
        "Manifest: {:?}",
        manifest
    );
@@ -300,7 +305,7 @@ async fn test_gc_worker_basic_compact() {

    let region = engine.get_region(region_id).unwrap();
    let manifest = region.manifest_ctx.manifest().await;
-    assert_eq!(manifest.removed_files.removed_files[0].file_ids.len(), 3);
+    assert_eq!(manifest.removed_files.removed_files[0].files.len(), 3);

    let version = manifest.manifest_version;

@@ -376,7 +381,7 @@ async fn test_gc_worker_compact_with_ref() {

    let region = engine.get_region(region_id).unwrap();
    let manifest = region.manifest_ctx.manifest().await;
-    assert_eq!(manifest.removed_files.removed_files[0].file_ids.len(), 3);
+    assert_eq!(manifest.removed_files.removed_files[0].files.len(), 3);

    let version = manifest.manifest_version;

@@ -385,9 +390,12 @@ async fn test_gc_worker_compact_with_ref() {
        file_refs: HashMap::from([(
            region_id,
            manifest.removed_files.removed_files[0]
-                .file_ids
+                .files
                .iter()
-                .cloned()
+                .map(|removed_file| match removed_file {
+                    RemovedFile::File(file_id, v) => FileRef::new(region_id, *file_id, *v),
+                    RemovedFile::Index(file_id, v) => FileRef::new(region_id, *file_id, Some(*v)),
+                })
                .collect(),
        )]),
        manifest_version: [(region_id, version)].into(),
--- a/src/mito2/src/manifest/action.rs
+++ b/src/mito2/src/manifest/action.rs
@@ -22,7 +22,7 @@ use serde::{Deserialize, Serialize};
 use snafu::{OptionExt, ResultExt};
 use store_api::ManifestVersion;
 use store_api::metadata::RegionMetadataRef;
-use store_api::storage::{FileId, RegionId, SequenceNumber};
+use store_api::storage::{FileId, IndexVersion, RegionId, SequenceNumber};
 use strum::Display;

 use crate::error::{RegionMetadataNotFoundSnafu, Result, SerdeJsonSnafu, Utf8Snafu};
@@ -45,6 +45,18 @@ pub enum RegionMetaAction {
    Truncate(RegionTruncate),
 }

+impl RegionMetaAction {
+    /// Returns true if the action is a change action.
+    pub fn is_change(&self) -> bool {
+        matches!(self, RegionMetaAction::Change(_))
+    }
+
+    /// Returns true if the action is an edit action.
+    pub fn is_edit(&self) -> bool {
+        matches!(self, RegionMetaAction::Edit(_))
+    }
+}
+
 #[derive(Serialize, Deserialize, Clone, Debug, PartialEq, Eq)]
 pub struct RegionChange {
    /// The metadata after changed.
@@ -193,17 +205,27 @@ impl RegionManifestBuilder {

    pub fn apply_edit(&mut self, manifest_version: ManifestVersion, edit: RegionEdit) {
        self.manifest_version = manifest_version;
+
+        let mut removed_files = vec![];
        for file in edit.files_to_add {
-            self.files.insert(file.file_id, file);
+            if let Some(old_file) = self.files.insert(file.file_id, file.clone())
+                && let Some(old_index) = old_file.index_version()
+                && !old_file.is_index_up_to_date(&file)
+            {
+                // The old file has an index that is now outdated.
+                removed_files.push(RemovedFile::Index(old_file.file_id, old_index));
+            }
        }
-        self.removed_files.add_removed_files(
+        removed_files.extend(
            edit.files_to_remove
                .iter()
-                .map(|meta| meta.file_id)
-                .collect(),
-            edit.timestamp_ms
-                .unwrap_or_else(|| Utc::now().timestamp_millis()),
+                .map(|f| RemovedFile::File(f.file_id, f.index_version())),
        );
+        let at = edit
+            .timestamp_ms
+            .unwrap_or_else(|| Utc::now().timestamp_millis());
+        self.removed_files.add_removed_files(removed_files, at);
+
        for file in edit.files_to_remove {
            self.files.remove(&file.file_id);
        }
@@ -236,7 +258,10 @@ impl RegionManifestBuilder {
                self.flushed_sequence = truncated_sequence;
                self.truncated_entry_id = Some(truncated_entry_id);
                self.removed_files.add_removed_files(
-                    self.files.values().map(|meta| meta.file_id).collect(),
+                    self.files
+                        .values()
+                        .map(|f| RemovedFile::File(f.file_id, f.index_version()))
+                        .collect(),
                    truncate
                        .timestamp_ms
                        .unwrap_or_else(|| Utc::now().timestamp_millis()),
@@ -245,7 +270,10 @@ impl RegionManifestBuilder {
            }
            TruncateKind::Partial { files_to_remove } => {
                self.removed_files.add_removed_files(
-                    files_to_remove.iter().map(|meta| meta.file_id).collect(),
+                    files_to_remove
+                        .iter()
+                        .map(|f| RemovedFile::File(f.file_id, f.index_version()))
+                        .collect(),
                    truncate
                        .timestamp_ms
                        .unwrap_or_else(|| Utc::now().timestamp_millis()),
@@ -295,20 +323,22 @@ pub struct RemovedFilesRecord {

 impl RemovedFilesRecord {
    /// Clear the actually deleted files from the list of removed files
-    pub fn clear_deleted_files(&mut self, deleted_files: Vec<FileId>) {
+    pub fn clear_deleted_files(&mut self, deleted_files: Vec<RemovedFile>) {
        let deleted_file_set: HashSet<_> = HashSet::from_iter(deleted_files);
        for files in self.removed_files.iter_mut() {
-            files.file_ids.retain(|fid| !deleted_file_set.contains(fid));
+            files
+                .files
+                .retain(|removed| !deleted_file_set.contains(removed));
        }

-        self.removed_files.retain(|fs| !fs.file_ids.is_empty());
+        self.removed_files.retain(|fs| !fs.files.is_empty());
    }

    pub fn update_file_removed_cnt_to_stats(&self, stats: &ManifestStats) {
        let cnt = self
            .removed_files
            .iter()
-            .map(|r| r.file_ids.len() as u64)
+            .map(|r| r.files.len() as u64)
            .sum();
        stats
            .file_removed_cnt
@@ -322,18 +352,77 @@ pub struct RemovedFiles {
    /// the files are removed from manifest. The timestamp is in milliseconds since unix epoch.
    pub removed_at: i64,
    /// The set of file ids that are removed.
-    pub file_ids: HashSet<FileId>,
+    #[serde(default)]
+    pub files: HashSet<RemovedFile>,
+}
+
+/// A removed file, which can be a data file(optional paired with a index file) or an outdated index file.
+#[derive(Serialize, Hash, Clone, Debug, PartialEq, Eq)]
+pub enum RemovedFile {
+    File(FileId, Option<IndexVersion>),
+    Index(FileId, IndexVersion),
+}
+
+/// Support deserialize from old format(just FileId as string) for backward compatibility
+/// into current format(RemovedFile enum).
+/// This is needed just in case there are old manifests with removed files recorded.
+impl<'de> Deserialize<'de> for RemovedFile {
+    fn deserialize<D>(deserializer: D) -> std::result::Result<Self, D::Error>
+    where
+        D: serde::Deserializer<'de>,
+    {
+        #[derive(Deserialize)]
+        #[serde(untagged)]
+        enum CompatRemovedFile {
+            Enum(RemovedFileEnum),
+            FileId(FileId),
+        }
+
+        #[derive(Deserialize)]
+        enum RemovedFileEnum {
+            File(FileId, Option<IndexVersion>),
+            Index(FileId, IndexVersion),
+        }
+
+        let compat = CompatRemovedFile::deserialize(deserializer)?;
+        match compat {
+            CompatRemovedFile::FileId(file_id) => Ok(RemovedFile::File(file_id, None)),
+            CompatRemovedFile::Enum(e) => match e {
+                RemovedFileEnum::File(file_id, version) => Ok(RemovedFile::File(file_id, version)),
+                RemovedFileEnum::Index(file_id, version) => {
+                    Ok(RemovedFile::Index(file_id, version))
+                }
+            },
+        }
+    }
+}
+
+impl RemovedFile {
+    pub fn file_id(&self) -> FileId {
+        match self {
+            RemovedFile::File(file_id, _) => *file_id,
+            RemovedFile::Index(file_id, _) => *file_id,
+        }
+    }
+
+    pub fn index_version(&self) -> Option<IndexVersion> {
+        match self {
+            RemovedFile::File(_, index_version) => *index_version,
+            RemovedFile::Index(_, index_version) => Some(*index_version),
+        }
+    }
 }

 impl RemovedFilesRecord {
    /// Add a record of removed files with the current timestamp.
-    pub fn add_removed_files(&mut self, file_ids: HashSet<FileId>, at: i64) {
-        if file_ids.is_empty() {
+    pub fn add_removed_files(&mut self, removed: Vec<RemovedFile>, at: i64) {
+        if removed.is_empty() {
            return;
        }
+        let files = removed.into_iter().collect();
        self.removed_files.push(RemovedFiles {
            removed_at: at,
-            file_ids,
+            files,
        });
    }

@@ -396,7 +485,8 @@ impl RegionMetaActionList {
        Self { actions }
    }

-    pub fn into_region_edit(self) -> RegionEdit {
+    /// Split the actions into a region change and an edit.
+    pub fn split_region_change_and_edit(self) -> (Option<RegionChange>, RegionEdit) {
        let mut edit = RegionEdit {
            files_to_add: Vec::new(),
            files_to_remove: Vec::new(),
@@ -406,31 +496,39 @@ impl RegionMetaActionList {
            flushed_sequence: None,
            committed_sequence: None,
        };
-
+        let mut region_change = None;
        for action in self.actions {
-            if let RegionMetaAction::Edit(region_edit) = action {
-                // Merge file adds/removes
-                edit.files_to_add.extend(region_edit.files_to_add);
-                edit.files_to_remove.extend(region_edit.files_to_remove);
-                // Max of flushed entry id / sequence
-                if let Some(eid) = region_edit.flushed_entry_id {
-                    edit.flushed_entry_id = Some(edit.flushed_entry_id.map_or(eid, |v| v.max(eid)));
+            match action {
+                RegionMetaAction::Change(change) => {
+                    region_change = Some(change);
                }
-                if let Some(seq) = region_edit.flushed_sequence {
-                    edit.flushed_sequence = Some(edit.flushed_sequence.map_or(seq, |v| v.max(seq)));
-                }
-                if let Some(seq) = region_edit.committed_sequence {
-                    edit.committed_sequence =
-                        Some(edit.committed_sequence.map_or(seq, |v| v.max(seq)));
-                }
-                // Prefer the latest non-none time window
-                if region_edit.compaction_time_window.is_some() {
-                    edit.compaction_time_window = region_edit.compaction_time_window;
+                RegionMetaAction::Edit(region_edit) => {
+                    // Merge file adds/removes
+                    edit.files_to_add.extend(region_edit.files_to_add);
+                    edit.files_to_remove.extend(region_edit.files_to_remove);
+                    // Max of flushed entry id / sequence
+                    if let Some(eid) = region_edit.flushed_entry_id {
+                        edit.flushed_entry_id =
+                            Some(edit.flushed_entry_id.map_or(eid, |v| v.max(eid)));
+                    }
+                    if let Some(seq) = region_edit.flushed_sequence {
+                        edit.flushed_sequence =
+                            Some(edit.flushed_sequence.map_or(seq, |v| v.max(seq)));
+                    }
+                    if let Some(seq) = region_edit.committed_sequence {
+                        edit.committed_sequence =
+                            Some(edit.committed_sequence.map_or(seq, |v| v.max(seq)));
+                    }
+                    // Prefer the latest non-none time window
+                    if region_edit.compaction_time_window.is_some() {
+                        edit.compaction_time_window = region_edit.compaction_time_window;
+                    }
                }
+                _ => {}
            }
        }

-        edit
+        (region_change, edit)
    }
 }

@@ -738,10 +836,10 @@ mod tests {
            removed_files: RemovedFilesRecord {
                removed_files: vec![RemovedFiles {
                    removed_at: 0,
-                    file_ids: HashSet::from([FileId::parse_str(
-                        "4b220a70-2b03-4641-9687-b65d94641208",
-                    )
-                    .unwrap()]),
+                    files: HashSet::from([RemovedFile::File(
+                        FileId::parse_str("4b220a70-2b03-4641-9687-b65d94641208").unwrap(),
+                        None,
+                    )]),
                }],
            },
            sst_format: FormatType::PrimaryKey,
@@ -966,4 +1064,115 @@ mod tests {
        let deserialized: RegionChange = serde_json::from_str(&serialized).unwrap();
        assert_eq!(deserialized.sst_format, FormatType::Flat);
    }
+
+    #[test]
+    fn test_removed_file_compatibility() {
+        let file_id = FileId::random();
+        // Case 1: Deserialize from FileId string (Legacy format)
+        let json_str = format!("\"{}\"", file_id);
+        let removed_file: RemovedFile = serde_json::from_str(&json_str).unwrap();
+        assert_eq!(removed_file, RemovedFile::File(file_id, None));
+
+        // Case 2: Deserialize from new format (File)
+        let removed_file_v2 = RemovedFile::File(file_id, Some(10));
+        let json_v2 = serde_json::to_string(&removed_file_v2).unwrap();
+        let deserialized_v2: RemovedFile = serde_json::from_str(&json_v2).unwrap();
+        assert_eq!(removed_file_v2, deserialized_v2);
+
+        // Case 3: Deserialize from new format (Index)
+        let removed_index = RemovedFile::Index(file_id, 20);
+        let json_index = serde_json::to_string(&removed_index).unwrap();
+        let deserialized_index: RemovedFile = serde_json::from_str(&json_index).unwrap();
+        assert_eq!(removed_index, deserialized_index);
+
+        // Case 4: Round-trip serialization/deserialization of new enum format with None as index version
+        let removed_file = RemovedFile::File(file_id, None);
+        let json = serde_json::to_string(&removed_file).unwrap();
+        let deserialized: RemovedFile = serde_json::from_str(&json).unwrap();
+        assert_eq!(removed_file, deserialized);
+
+        // Case 5: Deserialize mixed set in RemovedFilesRecord
+        // This simulates a Set<RemovedFile> which might contain old strings or new objects if manually constructed or from old versions.
+        // Actually, if it was HashSet<FileId>, the JSON is ["id1", "id2"].
+        // If it is HashSet<RemovedFile>, the JSON is [{"File":...}, "id2"] if mixed (which shouldn't happen usually but good to test).
+
+        let json_set = format!("[\"{}\"]", file_id);
+        let removed_files_set: HashSet<RemovedFile> = serde_json::from_str(&json_set).unwrap();
+        assert!(removed_files_set.contains(&RemovedFile::File(file_id, None)));
+    }
+
+    /// It is intentionally acceptable to ignore the legacy `file_ids` field when
+    /// deserializing [`RemovedFiles`].
+    ///
+    /// In older manifests, `file_ids` recorded the set of SSTable files that were
+    /// candidates for garbage collection at a given `removed_at` timestamp. The
+    /// newer format stores this information in the `files` field instead. When we
+    /// deserialize an old manifest entry into the new struct, we *drop* the
+    /// `file_ids` field instead of trying to recover or merge it.
+    ///
+    /// Dropping `file_ids` does **not** risk deleting live data: a file is only
+    /// physically removed when it is both (a) no longer referenced by any region
+    /// metadata and (b) selected by the GC worker as safe to delete. Losing the
+    /// historical list of candidate `file_ids` merely means some obsolete files
+    /// may stay on disk longer than strictly necessary.
+    ///
+    /// The GC worker periodically scans storage (e.g. by walking the data
+    /// directories and/or consulting the latest manifest) to discover files that
+    /// are no longer referenced anywhere. Any files that were only referenced via
+    /// the dropped `file_ids` field will be rediscovered during these scans and
+    /// eventually deleted. Thus the system converges to a correct, fully-collected
+    /// state without relying on `file_ids`, and the only potential impact of
+    /// ignoring it is temporary disk space overhead, not data loss.
+    #[test]
+    fn test_removed_files_backward_compatibility() {
+        // Define the old version struct with file_ids field
+        #[derive(Serialize, Deserialize, Clone, Debug, PartialEq, Eq)]
+        struct OldRemovedFiles {
+            pub removed_at: i64,
+            pub file_ids: HashSet<FileId>,
+        }
+
+        // Create an old version instance
+        let mut file_ids = HashSet::new();
+        file_ids.insert(FileId::random());
+        file_ids.insert(FileId::random());
+
+        let old_removed_files = OldRemovedFiles {
+            removed_at: 1234567890,
+            file_ids,
+        };
+
+        // Serialize the old version
+        let old_json = serde_json::to_string(&old_removed_files).unwrap();
+
+        // Try to deserialize into new version - file_ids should be ignored
+        let result: Result<RemovedFiles, _> = serde_json::from_str(&old_json);
+
+        // This should succeed and create a default RemovedFiles (empty files set)
+        assert!(result.is_ok(), "{:?}", result);
+        let removed_files = result.unwrap();
+        assert_eq!(removed_files.removed_at, 1234567890);
+        assert!(removed_files.files.is_empty());
+
+        // Test that new format still works
+        let file_id = FileId::random();
+        let new_json = format!(
+            r#"{{
+            "removed_at": 1234567890,
+            "files": ["{}"]
+        }}"#,
+            file_id
+        );
+
+        let result: Result<RemovedFiles, _> = serde_json::from_str(&new_json);
+        assert!(result.is_ok());
+        let removed_files = result.unwrap();
+        assert_eq!(removed_files.removed_at, 1234567890);
+        assert_eq!(removed_files.files.len(), 1);
+        assert!(
+            removed_files
+                .files
+                .contains(&RemovedFile::File(file_id, None))
+        );
+    }
 }
--- a/src/mito2/src/manifest/manager.rs
+++ b/src/mito2/src/manifest/manager.rs
@@ -21,7 +21,6 @@ use futures::TryStreamExt;
 use object_store::ObjectStore;
 use snafu::{OptionExt, ResultExt, ensure};
 use store_api::metadata::RegionMetadataRef;
-use store_api::storage::FileId;
 use store_api::{MAX_VERSION, MIN_VERSION, ManifestVersion};

 use crate::cache::manifest_cache::ManifestCache;
@@ -31,7 +30,7 @@ use crate::error::{
 };
 use crate::manifest::action::{
    RegionChange, RegionCheckpoint, RegionEdit, RegionManifest, RegionManifestBuilder,
-    RegionMetaAction, RegionMetaActionList,
+    RegionMetaAction, RegionMetaActionList, RemovedFile,
 };
 use crate::manifest::checkpointer::Checkpointer;
 use crate::manifest::storage::{
@@ -589,7 +588,7 @@ impl RegionManifestManager {
    }

    /// Clear deleted files from manifest's `removed_files` field without update version. Notice if datanode exit before checkpoint then new manifest by open region may still contain these deleted files, which is acceptable for gc process.
-    pub fn clear_deleted_files(&mut self, deleted_files: Vec<FileId>) {
+    pub fn clear_deleted_files(&mut self, deleted_files: Vec<RemovedFile>) {
        let mut manifest = (*self.manifest()).clone();
        manifest.removed_files.clear_deleted_files(deleted_files);
        self.set_manifest(Arc::new(manifest));
--- a/src/mito2/src/manifest/storage.rs
+++ b/src/mito2/src/manifest/storage.rs
@@ -12,43 +12,46 @@
 // See the License for the specific language governing permissions and
 // limitations under the License.

-use std::collections::HashMap;
+pub(crate) mod checkpoint;
+pub(crate) mod delta;
+pub(crate) mod size_tracker;
+pub(crate) mod staging;
+pub(crate) mod utils;
+
 use std::iter::Iterator;
 use std::str::FromStr;
-use std::sync::atomic::{AtomicU64, Ordering};
-use std::sync::{Arc, RwLock};
+use std::sync::Arc;
+use std::sync::atomic::AtomicU64;

 use common_datasource::compression::CompressionType;
 use common_telemetry::debug;
 use crc32fast::Hasher;
-use futures::TryStreamExt;
-use futures::future::try_join_all;
 use lazy_static::lazy_static;
 use object_store::util::join_dir;
-use object_store::{Entry, ErrorKind, Lister, ObjectStore, util};
+use object_store::{Lister, ObjectStore, util};
 use regex::Regex;
-use serde::{Deserialize, Serialize};
 use snafu::{ResultExt, ensure};
 use store_api::ManifestVersion;
 use store_api::storage::RegionId;
-use tokio::sync::Semaphore;

 use crate::cache::manifest_cache::ManifestCache;
-use crate::error::{
-    ChecksumMismatchSnafu, CompressObjectSnafu, DecompressObjectSnafu, InvalidScanIndexSnafu,
-    OpenDalSnafu, Result, SerdeJsonSnafu, Utf8Snafu,
-};
+use crate::error::{ChecksumMismatchSnafu, OpenDalSnafu, Result};
+use crate::manifest::storage::checkpoint::CheckpointStorage;
+use crate::manifest::storage::delta::DeltaStorage;
+use crate::manifest::storage::size_tracker::{CheckpointTracker, DeltaTracker, SizeTracker};
+use crate::manifest::storage::staging::StagingStorage;
+use crate::manifest::storage::utils::remove_from_cache;

 lazy_static! {
    static ref DELTA_RE: Regex = Regex::new("^\\d+\\.json").unwrap();
    static ref CHECKPOINT_RE: Regex = Regex::new("^\\d+\\.checkpoint").unwrap();
 }

-const LAST_CHECKPOINT_FILE: &str = "_last_checkpoint";
+pub const LAST_CHECKPOINT_FILE: &str = "_last_checkpoint";
 const DEFAULT_MANIFEST_COMPRESSION_TYPE: CompressionType = CompressionType::Gzip;
 /// Due to backward compatibility, it is possible that the user's manifest file has not been compressed.
 /// So when we encounter problems, we need to fall back to `FALL_BACK_COMPRESS_TYPE` for processing.
-const FALL_BACK_COMPRESS_TYPE: CompressionType = CompressionType::Uncompressed;
+pub(crate) const FALL_BACK_COMPRESS_TYPE: CompressionType = CompressionType::Uncompressed;
 const FETCH_MANIFEST_PARALLELISM: usize = 16;

 /// Returns the directory to the manifest files.
@@ -81,13 +84,13 @@ pub fn gen_path(path: &str, file: &str, compress_type: CompressionType) -> Strin
    }
 }

-fn checkpoint_checksum(data: &[u8]) -> u32 {
+pub(crate) fn checkpoint_checksum(data: &[u8]) -> u32 {
    let mut hasher = Hasher::new();
    hasher.update(data);
    hasher.finalize()
 }

-fn verify_checksum(data: &[u8], wanted: Option<u32>) -> Result<()> {
+pub(crate) fn verify_checksum(data: &[u8], wanted: Option<u32>) -> Result<()> {
    if let Some(checksum) = wanted {
        let calculated_checksum = checkpoint_checksum(data);
        ensure!(
@@ -127,26 +130,20 @@ pub fn is_checkpoint_file(file_name: &str) -> bool {
    CHECKPOINT_RE.is_match(file_name)
 }

-/// Key to identify a manifest file.
-#[derive(Debug, Clone, Copy, Eq, PartialEq, Hash)]
-enum FileKey {
-    /// A delta file (`.json`).
-    Delta(ManifestVersion),
-    /// A checkpoint file (`.checkpoint`).
-    Checkpoint(ManifestVersion),
-}
-
 #[derive(Clone, Debug)]
 pub struct ManifestObjectStore {
    object_store: ObjectStore,
-    compress_type: CompressionType,
    path: String,
-    staging_path: String,
-    /// Stores the size of each manifest file.
-    manifest_size_map: Arc<RwLock<HashMap<FileKey, u64>>>,
-    total_manifest_size: Arc<AtomicU64>,
    /// Optional manifest cache for local caching.
    manifest_cache: Option<ManifestCache>,
+    // Tracks the size of each file in the manifest directory.
+    size_tracker: SizeTracker,
+    // The checkpoint file storage.
+    checkpoint_storage: CheckpointStorage<CheckpointTracker>,
+    // The delta file storage.
+    delta_storage: DeltaStorage<DeltaTracker>,
+    /// The staging file storage.
+    staging_storage: StagingStorage,
 }

 impl ManifestObjectStore {
@@ -160,43 +157,37 @@ impl ManifestObjectStore {
        common_telemetry::info!("Create manifest store, cache: {}", manifest_cache.is_some());

        let path = util::normalize_dir(path);
-        let staging_path = {
-            // Convert "region_dir/manifest/" to "region_dir/staging/manifest/"
-            let parent_dir = path.trim_end_matches("manifest/").trim_end_matches('/');
-            util::normalize_dir(&format!("{}/staging/manifest", parent_dir))
-        };
+        let size_tracker = SizeTracker::new(total_manifest_size);
+        let checkpoint_tracker = Arc::new(size_tracker.checkpoint_tracker());
+        let delta_tracker = Arc::new(size_tracker.manifest_tracker());
+        let checkpoint_storage = CheckpointStorage::new(
+            path.clone(),
+            object_store.clone(),
+            compress_type,
+            manifest_cache.clone(),
+            checkpoint_tracker,
+        );
+        let delta_storage = DeltaStorage::new(
+            path.clone(),
+            object_store.clone(),
+            compress_type,
+            manifest_cache.clone(),
+            delta_tracker,
+        );
+        let staging_storage =
+            StagingStorage::new(path.clone(), object_store.clone(), compress_type);
+
        Self {
            object_store,
-            compress_type,
            path,
-            staging_path,
-            manifest_size_map: Arc::new(RwLock::new(HashMap::new())),
-            total_manifest_size,
            manifest_cache,
+            size_tracker,
+            checkpoint_storage,
+            delta_storage,
+            staging_storage,
        }
    }

-    /// Returns the delta file path under the **current** compression algorithm
-    fn delta_file_path(&self, version: ManifestVersion, is_staging: bool) -> String {
-        let base_path = if is_staging {
-            &self.staging_path
-        } else {
-            &self.path
-        };
-        gen_path(base_path, &delta_file(version), self.compress_type)
-    }
-
-    /// Returns the checkpoint file path under the **current** compression algorithm
-    fn checkpoint_file_path(&self, version: ManifestVersion) -> String {
-        gen_path(&self.path, &checkpoint_file(version), self.compress_type)
-    }
-
-    /// Returns the last checkpoint path, because the last checkpoint is not compressed,
-    /// so its path name has nothing to do with the compression algorithm used by `ManifestObjectStore`
-    pub(crate) fn last_checkpoint_path(&self) -> String {
-        format!("{}{}", self.path, LAST_CHECKPOINT_FILE)
-    }
-
    /// Returns the manifest dir
    pub(crate) fn manifest_dir(&self) -> &str {
        &self.path
@@ -204,75 +195,14 @@ impl ManifestObjectStore {

    /// Returns an iterator of manifests from normal or staging directory.
    pub(crate) async fn manifest_lister(&self, is_staging: bool) -> Result<Option<Lister>> {
-        let path = if is_staging {
-            &self.staging_path
+        if is_staging {
+            self.staging_storage.manifest_lister().await
        } else {
-            &self.path
-        };
-        match self.object_store.lister_with(path).await {
-            Ok(streamer) => Ok(Some(streamer)),
-            Err(e) if e.kind() == ErrorKind::NotFound => {
-                debug!("Manifest directory does not exist: {}", path);
-                Ok(None)
-            }
-            Err(e) => Err(e).context(OpenDalSnafu)?,
+            self.delta_storage.manifest_lister().await
        }
    }

-    /// Return all `R`s in the directory that meet the `filter` conditions (that is, the `filter` closure returns `Some(R)`),
-    /// and discard `R` that does not meet the conditions (that is, the `filter` closure returns `None`)
-    /// Return an empty vector when directory is not found.
-    pub async fn get_paths<F, R>(&self, filter: F, is_staging: bool) -> Result<Vec<R>>
-    where
-        F: Fn(Entry) -> Option<R>,
-    {
-        let Some(streamer) = self.manifest_lister(is_staging).await? else {
-            return Ok(vec![]);
-        };
-
-        streamer
-            .try_filter_map(|e| async { Ok(filter(e)) })
-            .try_collect::<Vec<_>>()
-            .await
-            .context(OpenDalSnafu)
-    }
-
-    /// Sorts the manifest files.
-    fn sort_manifests(entries: &mut [(ManifestVersion, Entry)]) {
-        entries.sort_unstable_by(|(v1, _), (v2, _)| v1.cmp(v2));
-    }
-
-    /// Scans the manifest files in the range of [start, end) and return all manifest entries.
-    pub async fn scan(
-        &self,
-        start: ManifestVersion,
-        end: ManifestVersion,
-    ) -> Result<Vec<(ManifestVersion, Entry)>> {
-        ensure!(start <= end, InvalidScanIndexSnafu { start, end });
-
-        let mut entries: Vec<(ManifestVersion, Entry)> = self
-            .get_paths(
-                |entry| {
-                    let file_name = entry.name();
-                    if is_delta_file(file_name) {
-                        let version = file_version(file_name);
-                        if start <= version && version < end {
-                            return Some((version, entry));
-                        }
-                    }
-                    None
-                },
-                false,
-            )
-            .await?;
-
-        Self::sort_manifests(&mut entries);
-
-        Ok(entries)
-    }
-
    /// Fetches manifests in range [start_version, end_version).
-    ///
    /// This functions is guaranteed to return manifests from the `start_version` strictly (must contain `start_version`).
    pub async fn fetch_manifests_strict_from(
        &self,
@@ -280,70 +210,9 @@ impl ManifestObjectStore {
        end_version: ManifestVersion,
        region_id: RegionId,
    ) -> Result<Vec<(ManifestVersion, Vec<u8>)>> {
-        let mut manifests = self.fetch_manifests(start_version, end_version).await?;
-        let start_index = manifests.iter().position(|(v, _)| *v == start_version);
-        debug!(
-            "Fetches manifests in range [{},{}), start_index: {:?}, region_id: {}, manifests: {:?}",
-            start_version,
-            end_version,
-            start_index,
-            region_id,
-            manifests.iter().map(|(v, _)| *v).collect::<Vec<_>>()
-        );
-        if let Some(start_index) = start_index {
-            Ok(manifests.split_off(start_index))
-        } else {
-            Ok(vec![])
-        }
-    }
-
-    /// Common implementation for fetching manifests from entries in parallel.
-    /// If `is_staging` is true, cache is skipped.
-    async fn fetch_manifests_from_entries(
-        &self,
-        entries: Vec<(ManifestVersion, Entry)>,
-        is_staging: bool,
-    ) -> Result<Vec<(ManifestVersion, Vec<u8>)>> {
-        if entries.is_empty() {
-            return Ok(vec![]);
-        }
-
-        // TODO(weny): Make it configurable.
-        let semaphore = Semaphore::new(FETCH_MANIFEST_PARALLELISM);
-
-        let tasks = entries.iter().map(|(v, entry)| async {
-            // Safety: semaphore must exist.
-            let _permit = semaphore.acquire().await.unwrap();
-
-            let cache_key = entry.path();
-            // Try to get from cache first
-            if let Some(data) = self.get_from_cache(cache_key, is_staging).await {
-                return Ok((*v, data));
-            }
-
-            // Fetch from remote object store
-            let compress_type = file_compress_type(entry.name());
-            let bytes = self
-                .object_store
-                .read(entry.path())
-                .await
-                .context(OpenDalSnafu)?;
-            let data = compress_type
-                .decode(bytes)
-                .await
-                .context(DecompressObjectSnafu {
-                    compress_type,
-                    path: entry.path(),
-                })?;
-
-            // Add to cache
-            self.put_to_cache(cache_key.to_string(), &data, is_staging)
-                .await;
-
-            Ok((*v, data))
-        });
-
-        try_join_all(tasks).await
+        self.delta_storage
+            .fetch_manifests_strict_from(start_version, end_version, region_id)
+            .await
    }

    /// Fetch all manifests in concurrent, and return the manifests in range [start_version, end_version)
@@ -355,8 +224,9 @@ impl ManifestObjectStore {
        start_version: ManifestVersion,
        end_version: ManifestVersion,
    ) -> Result<Vec<(ManifestVersion, Vec<u8>)>> {
-        let manifests = self.scan(start_version, end_version).await?;
-        self.fetch_manifests_from_entries(manifests, false).await
+        self.delta_storage
+            .fetch_manifests(start_version, end_version)
+            .await
    }

    /// Delete manifest files that version < end.
@@ -370,20 +240,18 @@ impl ManifestObjectStore {
    ) -> Result<usize> {
        // Stores (entry, is_checkpoint, version) in a Vec.
        let entries: Vec<_> = self
-            .get_paths(
-                |entry| {
-                    let file_name = entry.name();
-                    let is_checkpoint = is_checkpoint_file(file_name);
-                    if is_delta_file(file_name) || is_checkpoint_file(file_name) {
-                        let version = file_version(file_name);
-                        if version < end {
-                            return Some((entry, is_checkpoint, version));
-                        }
+            .delta_storage
+            .get_paths(|entry| {
+                let file_name = entry.name();
+                let is_checkpoint = is_checkpoint_file(file_name);
+                if is_delta_file(file_name) || is_checkpoint_file(file_name) {
+                    let version = file_version(file_name);
+                    if version < end {
+                        return Some((entry, is_checkpoint, version));
                    }
-                    None
-                },
-                false,
-            )
+                }
+                None
+            })
            .await?;
        let checkpoint_version = if keep_last_checkpoint {
            // Note that the order of entries is unspecific.
@@ -428,7 +296,7 @@ impl ManifestObjectStore {

        // Remove from cache first
        for (entry, _, _) in &del_entries {
-            self.remove_from_cache(entry.path()).await;
+            remove_from_cache(self.manifest_cache.as_ref(), entry.path()).await;
        }

        self.object_store
@@ -439,9 +307,11 @@ impl ManifestObjectStore {
        // delete manifest sizes
        for (_, is_checkpoint, version) in &del_entries {
            if *is_checkpoint {
-                self.unset_file_size(&FileKey::Checkpoint(*version));
+                self.size_tracker
+                    .remove(&size_tracker::FileKey::Checkpoint(*version));
            } else {
-                self.unset_file_size(&FileKey::Delta(*version));
+                self.size_tracker
+                    .remove(&size_tracker::FileKey::Delta(*version));
            }
        }

@@ -455,22 +325,11 @@ impl ManifestObjectStore {
        bytes: &[u8],
        is_staging: bool,
    ) -> Result<()> {
-        let path = self.delta_file_path(version, is_staging);
-        debug!("Save log to manifest storage, version: {}", version);
-        let data = self
-            .compress_type
-            .encode(bytes)
-            .await
-            .context(CompressObjectSnafu {
-                compress_type: self.compress_type,
-                path: &path,
-            })?;
-        let delta_size = data.len();
-
-        self.write_and_put_cache(&path, data, is_staging).await?;
-        self.set_delta_file_size(version, delta_size as u64);
-
-        Ok(())
+        if is_staging {
+            self.staging_storage.save(version, bytes).await
+        } else {
+            self.delta_storage.save(version, bytes).await
+        }
    }

    /// Save the checkpoint manifest file.
@@ -479,155 +338,50 @@ impl ManifestObjectStore {
        version: ManifestVersion,
        bytes: &[u8],
    ) -> Result<()> {
-        let path = self.checkpoint_file_path(version);
-        let data = self
-            .compress_type
-            .encode(bytes)
+        self.checkpoint_storage
+            .save_checkpoint(version, bytes)
            .await
-            .context(CompressObjectSnafu {
-                compress_type: self.compress_type,
-                path: &path,
-            })?;
-        let checkpoint_size = data.len();
-        let checksum = checkpoint_checksum(bytes);
-
-        self.write_and_put_cache(&path, data, false).await?;
-        self.set_checkpoint_file_size(version, checkpoint_size as u64);
-
-        // Because last checkpoint file only contain size and version, which is tiny, so we don't compress it.
-        let last_checkpoint_path = self.last_checkpoint_path();
-
-        let checkpoint_metadata = CheckpointMetadata {
-            size: bytes.len(),
-            version,
-            checksum: Some(checksum),
-            extend_metadata: HashMap::new(),
-        };
-
-        debug!(
-            "Save checkpoint in path: {},  metadata: {:?}",
-            last_checkpoint_path, checkpoint_metadata
-        );
-
-        let bytes = checkpoint_metadata.encode()?;
-        self.object_store
-            .write(&last_checkpoint_path, bytes)
-            .await
-            .context(OpenDalSnafu)?;
-
-        Ok(())
-    }
-
-    async fn load_checkpoint(
-        &mut self,
-        metadata: CheckpointMetadata,
-    ) -> Result<Option<(ManifestVersion, Vec<u8>)>> {
-        let version = metadata.version;
-        let path = self.checkpoint_file_path(version);
-
-        // Try to get from cache first
-        if let Some(data) = self.get_from_cache(&path, false).await {
-            verify_checksum(&data, metadata.checksum)?;
-            return Ok(Some((version, data)));
-        }
-
-        // Due to backward compatibility, it is possible that the user's checkpoint not compressed,
-        // so if we don't find file by compressed type. fall back to checkpoint not compressed find again.
-        let checkpoint_data = match self.object_store.read(&path).await {
-            Ok(checkpoint) => {
-                let checkpoint_size = checkpoint.len();
-                let decompress_data =
-                    self.compress_type
-                        .decode(checkpoint)
-                        .await
-                        .with_context(|_| DecompressObjectSnafu {
-                            compress_type: self.compress_type,
-                            path: path.clone(),
-                        })?;
-                verify_checksum(&decompress_data, metadata.checksum)?;
-                // set the checkpoint size
-                self.set_checkpoint_file_size(version, checkpoint_size as u64);
-                // Add to cache
-                self.put_to_cache(path, &decompress_data, false).await;
-                Ok(Some(decompress_data))
-            }
-            Err(e) => {
-                if e.kind() == ErrorKind::NotFound {
-                    if self.compress_type != FALL_BACK_COMPRESS_TYPE {
-                        let fall_back_path = gen_path(
-                            &self.path,
-                            &checkpoint_file(version),
-                            FALL_BACK_COMPRESS_TYPE,
-                        );
-                        debug!(
-                            "Failed to load checkpoint from path: {}, fall back to path: {}",
-                            path, fall_back_path
-                        );
-
-                        // Try to get fallback from cache first
-                        if let Some(data) = self.get_from_cache(&fall_back_path, false).await {
-                            verify_checksum(&data, metadata.checksum)?;
-                            return Ok(Some((version, data)));
-                        }
-
-                        match self.object_store.read(&fall_back_path).await {
-                            Ok(checkpoint) => {
-                                let checkpoint_size = checkpoint.len();
-                                let decompress_data = FALL_BACK_COMPRESS_TYPE
-                                    .decode(checkpoint)
-                                    .await
-                                    .with_context(|_| DecompressObjectSnafu {
-                                        compress_type: FALL_BACK_COMPRESS_TYPE,
-                                        path: fall_back_path.clone(),
-                                    })?;
-                                verify_checksum(&decompress_data, metadata.checksum)?;
-                                self.set_checkpoint_file_size(version, checkpoint_size as u64);
-                                // Add fallback to cache
-                                self.put_to_cache(fall_back_path, &decompress_data, false)
-                                    .await;
-                                Ok(Some(decompress_data))
-                            }
-                            Err(e) if e.kind() == ErrorKind::NotFound => Ok(None),
-                            Err(e) => Err(e).context(OpenDalSnafu),
-                        }
-                    } else {
-                        Ok(None)
-                    }
-                } else {
-                    Err(e).context(OpenDalSnafu)
-                }
-            }
-        }?;
-        Ok(checkpoint_data.map(|data| (version, data)))
    }

    /// Load the latest checkpoint.
    /// Return manifest version and the raw [RegionCheckpoint](crate::manifest::action::RegionCheckpoint) content if any
    pub async fn load_last_checkpoint(&mut self) -> Result<Option<(ManifestVersion, Vec<u8>)>> {
-        let last_checkpoint_path = self.last_checkpoint_path();
-
-        // Fetch from remote object store without cache
-        let last_checkpoint_data = match self.object_store.read(&last_checkpoint_path).await {
-            Ok(data) => data.to_vec(),
-            Err(e) if e.kind() == ErrorKind::NotFound => {
-                return Ok(None);
-            }
-            Err(e) => {
-                return Err(e).context(OpenDalSnafu)?;
-            }
-        };
-
-        let checkpoint_metadata = CheckpointMetadata::decode(&last_checkpoint_data)?;
-
-        debug!(
-            "Load checkpoint in path: {}, metadata: {:?}",
-            last_checkpoint_path, checkpoint_metadata
-        );
-
-        self.load_checkpoint(checkpoint_metadata).await
+        self.checkpoint_storage.load_last_checkpoint().await
    }

-    #[cfg(test)]
+    /// Compute the size(Byte) in manifest size map.
+    pub(crate) fn total_manifest_size(&self) -> u64 {
+        self.size_tracker.total()
+    }
+
+    /// Resets the size of all files.
+    pub(crate) fn reset_manifest_size(&mut self) {
+        self.size_tracker.reset();
+    }
+
+    /// Set the size of the delta file by delta version.
+    pub(crate) fn set_delta_file_size(&mut self, version: ManifestVersion, size: u64) {
+        self.size_tracker.record_delta(version, size);
+    }
+
+    /// Set the size of the checkpoint file by checkpoint version.
+    pub(crate) fn set_checkpoint_file_size(&self, version: ManifestVersion, size: u64) {
+        self.size_tracker.record_checkpoint(version, size);
+    }
+
+    /// Fetch all staging manifest files and return them as (version, action_list) pairs.
+    pub async fn fetch_staging_manifests(&self) -> Result<Vec<(ManifestVersion, Vec<u8>)>> {
+        self.staging_storage.fetch_manifests().await
+    }
+
+    /// Clear all staging manifest files.
+    pub async fn clear_staging_manifests(&mut self) -> Result<()> {
+        self.staging_storage.clear().await
+    }
+}
+
+#[cfg(test)]
+impl ManifestObjectStore {
    pub async fn read_file(&self, path: &str) -> Result<Vec<u8>> {
        self.object_store
            .read(path)
@@ -636,214 +390,18 @@ impl ManifestObjectStore {
            .map(|v| v.to_vec())
    }

-    #[cfg(test)]
-    pub async fn write_last_checkpoint(
-        &mut self,
-        version: ManifestVersion,
-        bytes: &[u8],
-    ) -> Result<()> {
-        let path = self.checkpoint_file_path(version);
-        let data = self
-            .compress_type
-            .encode(bytes)
-            .await
-            .context(CompressObjectSnafu {
-                compress_type: self.compress_type,
-                path: &path,
-            })?;
-
-        let checkpoint_size = data.len();
-
-        self.object_store
-            .write(&path, data)
-            .await
-            .context(OpenDalSnafu)?;
-
-        self.set_checkpoint_file_size(version, checkpoint_size as u64);
-
-        let last_checkpoint_path = self.last_checkpoint_path();
-        let checkpoint_metadata = CheckpointMetadata {
-            size: bytes.len(),
-            version,
-            checksum: Some(1218259706),
-            extend_metadata: HashMap::new(),
-        };
-
-        debug!(
-            "Rewrite checkpoint in path: {},  metadata: {:?}",
-            last_checkpoint_path, checkpoint_metadata
-        );
-
-        let bytes = checkpoint_metadata.encode()?;
-
-        // Overwrite the last checkpoint with the modified content
-        self.object_store
-            .write(&last_checkpoint_path, bytes.clone())
-            .await
-            .context(OpenDalSnafu)?;
-        Ok(())
+    pub(crate) fn checkpoint_storage(&self) -> &CheckpointStorage<CheckpointTracker> {
+        &self.checkpoint_storage
    }

-    /// Compute the size(Byte) in manifest size map.
-    pub(crate) fn total_manifest_size(&self) -> u64 {
-        self.manifest_size_map.read().unwrap().values().sum()
+    pub(crate) fn delta_storage(&self) -> &DeltaStorage<DeltaTracker> {
+        &self.delta_storage
    }

-    /// Resets the size of all files.
-    pub(crate) fn reset_manifest_size(&mut self) {
-        self.manifest_size_map.write().unwrap().clear();
-        self.total_manifest_size.store(0, Ordering::Relaxed);
-    }
-
-    /// Set the size of the delta file by delta version.
-    pub(crate) fn set_delta_file_size(&mut self, version: ManifestVersion, size: u64) {
-        let mut m = self.manifest_size_map.write().unwrap();
-        m.insert(FileKey::Delta(version), size);
-
-        self.inc_total_manifest_size(size);
-    }
-
-    /// Set the size of the checkpoint file by checkpoint version.
-    pub(crate) fn set_checkpoint_file_size(&self, version: ManifestVersion, size: u64) {
-        let mut m = self.manifest_size_map.write().unwrap();
-        m.insert(FileKey::Checkpoint(version), size);
-
-        self.inc_total_manifest_size(size);
-    }
-
-    fn unset_file_size(&self, key: &FileKey) {
-        let mut m = self.manifest_size_map.write().unwrap();
-        if let Some(val) = m.remove(key) {
-            debug!("Unset file size: {:?}, size: {}", key, val);
-            self.dec_total_manifest_size(val);
-        }
-    }
-
-    fn inc_total_manifest_size(&self, val: u64) {
-        self.total_manifest_size.fetch_add(val, Ordering::Relaxed);
-    }
-
-    fn dec_total_manifest_size(&self, val: u64) {
-        self.total_manifest_size.fetch_sub(val, Ordering::Relaxed);
-    }
-
-    /// Fetch all staging manifest files and return them as (version, action_list) pairs.
-    pub async fn fetch_staging_manifests(&self) -> Result<Vec<(ManifestVersion, Vec<u8>)>> {
-        let manifest_entries = self
-            .get_paths(
-                |entry| {
-                    let file_name = entry.name();
-                    if is_delta_file(file_name) {
-                        let version = file_version(file_name);
-                        Some((version, entry))
-                    } else {
-                        None
-                    }
-                },
-                true,
-            )
-            .await?;
-
-        let mut sorted_entries = manifest_entries;
-        Self::sort_manifests(&mut sorted_entries);
-
-        self.fetch_manifests_from_entries(sorted_entries, true)
-            .await
-    }
-
-    /// Clear all staging manifest files.
-    pub async fn clear_staging_manifests(&mut self) -> Result<()> {
-        self.object_store
-            .remove_all(&self.staging_path)
-            .await
-            .context(OpenDalSnafu)?;
-
-        debug!(
-            "Cleared all staging manifest files from {}",
-            self.staging_path
-        );
-
-        Ok(())
-    }
-
-    /// Gets a manifest file from cache.
-    /// Returns the file data if found in cache, None otherwise.
-    /// If `is_staging` is true, always returns None.
-    async fn get_from_cache(&self, key: &str, is_staging: bool) -> Option<Vec<u8>> {
-        if is_staging {
-            return None;
-        }
-        let cache = self.manifest_cache.as_ref()?;
-        cache.get_file(key).await
-    }
-
-    /// Puts a manifest file into cache.
-    /// If `is_staging` is true, does nothing.
-    async fn put_to_cache(&self, key: String, data: &[u8], is_staging: bool) {
-        if is_staging {
-            return;
-        }
-        let Some(cache) = &self.manifest_cache else {
-            return;
-        };
-
-        cache.put_file(key, data.to_vec()).await;
-    }
-
-    /// Writes data to object store and puts it into cache.
-    /// If `is_staging` is true, cache is skipped.
-    async fn write_and_put_cache(&self, path: &str, data: Vec<u8>, is_staging: bool) -> Result<()> {
-        // Clone data for cache before writing, only if cache is enabled and not staging
-        let cache_data = if !is_staging && self.manifest_cache.is_some() {
-            Some(data.clone())
-        } else {
-            None
-        };
-
-        // Write to object store
-        self.object_store
-            .write(path, data)
-            .await
-            .context(OpenDalSnafu)?;
-
-        // Put to cache if we cloned the data
-        if let Some(data) = cache_data {
-            self.put_to_cache(path.to_string(), &data, is_staging).await;
-        }
-
-        Ok(())
-    }
-
-    /// Removes a manifest file from cache.
-    async fn remove_from_cache(&self, key: &str) {
-        let Some(cache) = &self.manifest_cache else {
-            return;
-        };
-
-        cache.remove(key).await;
-    }
-}
-
-#[derive(Serialize, Deserialize, Debug)]
-pub(crate) struct CheckpointMetadata {
-    pub size: usize,
-    /// The latest version this checkpoint contains.
-    pub version: ManifestVersion,
-    pub checksum: Option<u32>,
-    pub extend_metadata: HashMap<String, String>,
-}
-
-impl CheckpointMetadata {
-    fn encode(&self) -> Result<Vec<u8>> {
-        Ok(serde_json::to_string(self)
-            .context(SerdeJsonSnafu)?
-            .into_bytes())
-    }
-
-    fn decode(bs: &[u8]) -> Result<Self> {
-        let data = std::str::from_utf8(bs).context(Utf8Snafu)?;
-
-        serde_json::from_str(data).context(SerdeJsonSnafu)
+    pub(crate) fn set_compress_type(&mut self, compress_type: CompressionType) {
+        self.checkpoint_storage.set_compress_type(compress_type);
+        self.delta_storage.set_compress_type(compress_type);
+        self.staging_storage.set_compress_type(compress_type);
    }
 }

@@ -854,6 +412,7 @@ mod tests {
    use object_store::services::Fs;

    use super::*;
+    use crate::manifest::storage::checkpoint::CheckpointMetadata;

    fn new_test_manifest_store() -> ManifestObjectStore {
        common_telemetry::init_default_ut_logging();
@@ -890,14 +449,14 @@ mod tests {
    #[tokio::test]
    async fn test_manifest_log_store_uncompress() {
        let mut log_store = new_test_manifest_store();
-        log_store.compress_type = CompressionType::Uncompressed;
+        log_store.set_compress_type(CompressionType::Uncompressed);
        test_manifest_log_store_case(log_store).await;
    }

    #[tokio::test]
    async fn test_manifest_log_store_compress() {
        let mut log_store = new_test_manifest_store();
-        log_store.compress_type = CompressionType::Gzip;
+        log_store.set_compress_type(CompressionType::Gzip);
        test_manifest_log_store_case(log_store).await;
    }

@@ -941,6 +500,7 @@ mod tests {
        //delete (,4) logs and keep checkpoint 3.
        let _ = log_store.delete_until(4, true).await.unwrap();
        let _ = log_store
+            .checkpoint_storage
            .load_checkpoint(new_checkpoint_metadata_with_version(3))
            .await
            .unwrap()
@@ -958,6 +518,7 @@ mod tests {
        let _ = log_store.delete_until(11, false).await.unwrap();
        assert!(
            log_store
+                .checkpoint_storage
                .load_checkpoint(new_checkpoint_metadata_with_version(3))
                .await
                .unwrap()
@@ -976,7 +537,7 @@ mod tests {
        let mut log_store = new_test_manifest_store();

        // write uncompress data to stimulate previously uncompressed data
-        log_store.compress_type = CompressionType::Uncompressed;
+        log_store.set_compress_type(CompressionType::Uncompressed);
        for v in 0..5 {
            log_store
                .save(v, format!("hello, {v}").as_bytes(), false)
@@ -989,7 +550,7 @@ mod tests {
            .unwrap();

        // change compress type
-        log_store.compress_type = CompressionType::Gzip;
+        log_store.set_compress_type(CompressionType::Gzip);

        // test load_last_checkpoint work correctly for previously uncompressed data
        let (v, checkpoint) = log_store.load_last_checkpoint().await.unwrap().unwrap();
@@ -1018,6 +579,7 @@ mod tests {
            assert_eq!(format!("hello, {v}").as_bytes(), bytes);
        }
        let (v, checkpoint) = log_store
+            .checkpoint_storage
            .load_checkpoint(new_checkpoint_metadata_with_version(5))
            .await
            .unwrap()
@@ -1052,7 +614,7 @@ mod tests {
    async fn test_uncompressed_manifest_files_size() {
        let mut log_store = new_test_manifest_store();
        // write 5 manifest files with uncompressed（8B per file）
-        log_store.compress_type = CompressionType::Uncompressed;
+        log_store.set_compress_type(CompressionType::Uncompressed);
        for v in 0..5 {
            log_store
                .save(v, format!("hello, {v}").as_bytes(), false)
@@ -1090,7 +652,7 @@ mod tests {
    async fn test_compressed_manifest_files_size() {
        let mut log_store = new_test_manifest_store();
        // Test with compressed manifest files
-        log_store.compress_type = CompressionType::Gzip;
+        log_store.set_compress_type(CompressionType::Gzip);
        // write 5 manifest files
        for v in 0..5 {
            log_store
--- a/src/mito2/src/manifest/storage/checkpoint.rs
+++ b/src/mito2/src/manifest/storage/checkpoint.rs
@@ -0,0 +1,316 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::collections::HashMap;
+use std::sync::Arc;
+
+use common_datasource::compression::CompressionType;
+use common_telemetry::debug;
+use object_store::{ErrorKind, ObjectStore};
+use serde::{Deserialize, Serialize};
+use snafu::ResultExt;
+use store_api::ManifestVersion;
+
+use crate::cache::manifest_cache::ManifestCache;
+use crate::error::{
+    CompressObjectSnafu, DecompressObjectSnafu, OpenDalSnafu, Result, SerdeJsonSnafu, Utf8Snafu,
+};
+use crate::manifest::storage::size_tracker::Tracker;
+use crate::manifest::storage::utils::{get_from_cache, put_to_cache, write_and_put_cache};
+use crate::manifest::storage::{
+    FALL_BACK_COMPRESS_TYPE, LAST_CHECKPOINT_FILE, checkpoint_checksum, checkpoint_file, gen_path,
+    verify_checksum,
+};
+
+#[derive(Serialize, Deserialize, Debug)]
+pub(crate) struct CheckpointMetadata {
+    pub size: usize,
+    /// The latest version this checkpoint contains.
+    pub version: ManifestVersion,
+    pub checksum: Option<u32>,
+    pub extend_metadata: HashMap<String, String>,
+}
+
+impl CheckpointMetadata {
+    fn encode(&self) -> Result<Vec<u8>> {
+        Ok(serde_json::to_string(self)
+            .context(SerdeJsonSnafu)?
+            .into_bytes())
+    }
+
+    fn decode(bs: &[u8]) -> Result<Self> {
+        let data = std::str::from_utf8(bs).context(Utf8Snafu)?;
+
+        serde_json::from_str(data).context(SerdeJsonSnafu)
+    }
+}
+
+/// Handle checkpoint storage operations.
+#[derive(Debug, Clone)]
+pub(crate) struct CheckpointStorage<T: Tracker> {
+    object_store: ObjectStore,
+    compress_type: CompressionType,
+    path: String,
+    manifest_cache: Option<ManifestCache>,
+    size_tracker: Arc<T>,
+}
+
+impl<T: Tracker> CheckpointStorage<T> {
+    pub fn new(
+        path: String,
+        object_store: ObjectStore,
+        compress_type: CompressionType,
+        manifest_cache: Option<ManifestCache>,
+        size_tracker: Arc<T>,
+    ) -> Self {
+        Self {
+            object_store,
+            compress_type,
+            path,
+            manifest_cache,
+            size_tracker,
+        }
+    }
+
+    /// Returns the last checkpoint path, because the last checkpoint is not compressed,
+    /// so its path name has nothing to do with the compression algorithm used by `ManifestObjectStore`
+    pub(crate) fn last_checkpoint_path(&self) -> String {
+        format!("{}{}", self.path, LAST_CHECKPOINT_FILE)
+    }
+
+    /// Returns the checkpoint file path under the **current** compression algorithm
+    fn checkpoint_file_path(&self, version: ManifestVersion) -> String {
+        gen_path(&self.path, &checkpoint_file(version), self.compress_type)
+    }
+
+    pub(crate) async fn load_checkpoint(
+        &mut self,
+        metadata: CheckpointMetadata,
+    ) -> Result<Option<(ManifestVersion, Vec<u8>)>> {
+        let version = metadata.version;
+        let path = self.checkpoint_file_path(version);
+
+        // Try to get from cache first
+        if let Some(data) = get_from_cache(self.manifest_cache.as_ref(), &path).await {
+            verify_checksum(&data, metadata.checksum)?;
+            return Ok(Some((version, data)));
+        }
+
+        // Due to backward compatibility, it is possible that the user's checkpoint not compressed,
+        // so if we don't find file by compressed type. fall back to checkpoint not compressed find again.
+        let checkpoint_data = match self.object_store.read(&path).await {
+            Ok(checkpoint) => {
+                let checkpoint_size = checkpoint.len();
+                let decompress_data =
+                    self.compress_type
+                        .decode(checkpoint)
+                        .await
+                        .with_context(|_| DecompressObjectSnafu {
+                            compress_type: self.compress_type,
+                            path: path.clone(),
+                        })?;
+                verify_checksum(&decompress_data, metadata.checksum)?;
+                // set the checkpoint size
+                self.size_tracker.record(version, checkpoint_size as u64);
+                // Add to cache
+                put_to_cache(self.manifest_cache.as_ref(), path, &decompress_data).await;
+                Ok(Some(decompress_data))
+            }
+            Err(e) => {
+                if e.kind() == ErrorKind::NotFound {
+                    if self.compress_type != FALL_BACK_COMPRESS_TYPE {
+                        let fall_back_path = gen_path(
+                            &self.path,
+                            &checkpoint_file(version),
+                            FALL_BACK_COMPRESS_TYPE,
+                        );
+                        debug!(
+                            "Failed to load checkpoint from path: {}, fall back to path: {}",
+                            path, fall_back_path
+                        );
+
+                        // Try to get fallback from cache first
+                        if let Some(data) =
+                            get_from_cache(self.manifest_cache.as_ref(), &fall_back_path).await
+                        {
+                            verify_checksum(&data, metadata.checksum)?;
+                            return Ok(Some((version, data)));
+                        }
+
+                        match self.object_store.read(&fall_back_path).await {
+                            Ok(checkpoint) => {
+                                let checkpoint_size = checkpoint.len();
+                                let decompress_data = FALL_BACK_COMPRESS_TYPE
+                                    .decode(checkpoint)
+                                    .await
+                                    .with_context(|_| DecompressObjectSnafu {
+                                        compress_type: FALL_BACK_COMPRESS_TYPE,
+                                        path: fall_back_path.clone(),
+                                    })?;
+                                verify_checksum(&decompress_data, metadata.checksum)?;
+                                self.size_tracker.record(version, checkpoint_size as u64);
+                                // Add fallback to cache
+                                put_to_cache(
+                                    self.manifest_cache.as_ref(),
+                                    fall_back_path,
+                                    &decompress_data,
+                                )
+                                .await;
+                                Ok(Some(decompress_data))
+                            }
+                            Err(e) if e.kind() == ErrorKind::NotFound => Ok(None),
+                            Err(e) => return Err(e).context(OpenDalSnafu),
+                        }
+                    } else {
+                        Ok(None)
+                    }
+                } else {
+                    Err(e).context(OpenDalSnafu)
+                }
+            }
+        }?;
+        Ok(checkpoint_data.map(|data| (version, data)))
+    }
+
+    /// Load the latest checkpoint.
+    /// Return manifest version and the raw [RegionCheckpoint](crate::manifest::action::RegionCheckpoint) content if any
+    pub async fn load_last_checkpoint(&mut self) -> Result<Option<(ManifestVersion, Vec<u8>)>> {
+        let last_checkpoint_path = self.last_checkpoint_path();
+
+        // Fetch from remote object store without cache
+        let last_checkpoint_data = match self.object_store.read(&last_checkpoint_path).await {
+            Ok(data) => data.to_vec(),
+            Err(e) if e.kind() == ErrorKind::NotFound => {
+                return Ok(None);
+            }
+            Err(e) => {
+                return Err(e).context(OpenDalSnafu)?;
+            }
+        };
+
+        let checkpoint_metadata = CheckpointMetadata::decode(&last_checkpoint_data)?;
+
+        debug!(
+            "Load checkpoint in path: {}, metadata: {:?}",
+            last_checkpoint_path, checkpoint_metadata
+        );
+
+        self.load_checkpoint(checkpoint_metadata).await
+    }
+
+    /// Save the checkpoint manifest file.
+    pub(crate) async fn save_checkpoint(
+        &self,
+        version: ManifestVersion,
+        bytes: &[u8],
+    ) -> Result<()> {
+        let path = self.checkpoint_file_path(version);
+        let data = self
+            .compress_type
+            .encode(bytes)
+            .await
+            .context(CompressObjectSnafu {
+                compress_type: self.compress_type,
+                path: &path,
+            })?;
+        let checkpoint_size = data.len();
+        let checksum = checkpoint_checksum(bytes);
+
+        write_and_put_cache(
+            &self.object_store,
+            self.manifest_cache.as_ref(),
+            &path,
+            data,
+        )
+        .await?;
+        self.size_tracker.record(version, checkpoint_size as u64);
+
+        // Because last checkpoint file only contain size and version, which is tiny, so we don't compress it.
+        let last_checkpoint_path = self.last_checkpoint_path();
+
+        let checkpoint_metadata = CheckpointMetadata {
+            size: bytes.len(),
+            version,
+            checksum: Some(checksum),
+            extend_metadata: HashMap::new(),
+        };
+
+        debug!(
+            "Save checkpoint in path: {},  metadata: {:?}",
+            last_checkpoint_path, checkpoint_metadata
+        );
+
+        let bytes = checkpoint_metadata.encode()?;
+        self.object_store
+            .write(&last_checkpoint_path, bytes)
+            .await
+            .context(OpenDalSnafu)?;
+
+        Ok(())
+    }
+}
+
+#[cfg(test)]
+impl<T: Tracker> CheckpointStorage<T> {
+    pub async fn write_last_checkpoint(
+        &self,
+        version: ManifestVersion,
+        bytes: &[u8],
+    ) -> Result<()> {
+        let path = self.checkpoint_file_path(version);
+        let data = self
+            .compress_type
+            .encode(bytes)
+            .await
+            .context(CompressObjectSnafu {
+                compress_type: self.compress_type,
+                path: &path,
+            })?;
+
+        let checkpoint_size = data.len();
+
+        self.object_store
+            .write(&path, data)
+            .await
+            .context(OpenDalSnafu)?;
+
+        self.size_tracker.record(version, checkpoint_size as u64);
+
+        let last_checkpoint_path = self.last_checkpoint_path();
+        let checkpoint_metadata = CheckpointMetadata {
+            size: bytes.len(),
+            version,
+            checksum: Some(1218259706),
+            extend_metadata: HashMap::new(),
+        };
+
+        debug!(
+            "Rewrite checkpoint in path: {},  metadata: {:?}",
+            last_checkpoint_path, checkpoint_metadata
+        );
+
+        let bytes = checkpoint_metadata.encode()?;
+
+        // Overwrite the last checkpoint with the modified content
+        self.object_store
+            .write(&last_checkpoint_path, bytes.clone())
+            .await
+            .context(OpenDalSnafu)?;
+        Ok(())
+    }
+
+    pub fn set_compress_type(&mut self, compress_type: CompressionType) {
+        self.compress_type = compress_type;
+    }
+}
--- a/src/mito2/src/manifest/storage/delta.rs
+++ b/src/mito2/src/manifest/storage/delta.rs
@@ -0,0 +1,251 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::sync::Arc;
+
+use common_datasource::compression::CompressionType;
+use common_telemetry::debug;
+use futures::TryStreamExt;
+use futures::future::try_join_all;
+use object_store::{Entry, ErrorKind, Lister, ObjectStore};
+use snafu::{ResultExt, ensure};
+use store_api::ManifestVersion;
+use store_api::storage::RegionId;
+use tokio::sync::Semaphore;
+
+use crate::cache::manifest_cache::ManifestCache;
+use crate::error::{
+    CompressObjectSnafu, DecompressObjectSnafu, InvalidScanIndexSnafu, OpenDalSnafu, Result,
+};
+use crate::manifest::storage::size_tracker::Tracker;
+use crate::manifest::storage::utils::{
+    get_from_cache, put_to_cache, sort_manifests, write_and_put_cache,
+};
+use crate::manifest::storage::{
+    FETCH_MANIFEST_PARALLELISM, delta_file, file_compress_type, file_version, gen_path,
+    is_delta_file,
+};
+
+#[derive(Debug, Clone)]
+pub(crate) struct DeltaStorage<T: Tracker> {
+    object_store: ObjectStore,
+    compress_type: CompressionType,
+    path: String,
+    delta_tracker: Arc<T>,
+    manifest_cache: Option<ManifestCache>,
+}
+
+impl<T: Tracker> DeltaStorage<T> {
+    pub(crate) fn new(
+        path: String,
+        object_store: ObjectStore,
+        compress_type: CompressionType,
+        manifest_cache: Option<ManifestCache>,
+        delta_tracker: Arc<T>,
+    ) -> Self {
+        Self {
+            object_store,
+            compress_type,
+            path,
+            delta_tracker,
+            manifest_cache,
+        }
+    }
+
+    pub(crate) fn path(&self) -> &str {
+        &self.path
+    }
+
+    pub(crate) fn object_store(&self) -> &ObjectStore {
+        &self.object_store
+    }
+
+    fn delta_file_path(&self, version: ManifestVersion) -> String {
+        gen_path(&self.path, &delta_file(version), self.compress_type)
+    }
+
+    /// Returns an iterator of manifests from path directory.
+    pub(crate) async fn manifest_lister(&self) -> Result<Option<Lister>> {
+        match self.object_store.lister_with(&self.path).await {
+            Ok(streamer) => Ok(Some(streamer)),
+            Err(e) if e.kind() == ErrorKind::NotFound => {
+                debug!("Manifest directory does not exist: {}", self.path);
+                Ok(None)
+            }
+            Err(e) => Err(e).context(OpenDalSnafu)?,
+        }
+    }
+
+    /// Return all `R`s in the directory that meet the `filter` conditions (that is, the `filter` closure returns `Some(R)`),
+    /// and discard `R` that does not meet the conditions (that is, the `filter` closure returns `None`)
+    /// Return an empty vector when directory is not found.
+    pub async fn get_paths<F, R>(&self, filter: F) -> Result<Vec<R>>
+    where
+        F: Fn(Entry) -> Option<R>,
+    {
+        let Some(streamer) = self.manifest_lister().await? else {
+            return Ok(vec![]);
+        };
+
+        streamer
+            .try_filter_map(|e| async { Ok(filter(e)) })
+            .try_collect::<Vec<_>>()
+            .await
+            .context(OpenDalSnafu)
+    }
+
+    /// Scans the manifest files in the range of [start, end) and return all manifest entries.
+    pub async fn scan(
+        &self,
+        start: ManifestVersion,
+        end: ManifestVersion,
+    ) -> Result<Vec<(ManifestVersion, Entry)>> {
+        ensure!(start <= end, InvalidScanIndexSnafu { start, end });
+
+        let mut entries: Vec<(ManifestVersion, Entry)> = self
+            .get_paths(|entry| {
+                let file_name = entry.name();
+                if is_delta_file(file_name) {
+                    let version = file_version(file_name);
+                    if start <= version && version < end {
+                        return Some((version, entry));
+                    }
+                }
+                None
+            })
+            .await?;
+
+        sort_manifests(&mut entries);
+
+        Ok(entries)
+    }
+
+    /// Fetches manifests in range [start_version, end_version).
+    ///
+    /// This functions is guaranteed to return manifests from the `start_version` strictly (must contain `start_version`).
+    pub async fn fetch_manifests_strict_from(
+        &self,
+        start_version: ManifestVersion,
+        end_version: ManifestVersion,
+        region_id: RegionId,
+    ) -> Result<Vec<(ManifestVersion, Vec<u8>)>> {
+        let mut manifests = self.fetch_manifests(start_version, end_version).await?;
+        let start_index = manifests.iter().position(|(v, _)| *v == start_version);
+        debug!(
+            "Fetches manifests in range [{},{}), start_index: {:?}, region_id: {}, manifests: {:?}",
+            start_version,
+            end_version,
+            start_index,
+            region_id,
+            manifests.iter().map(|(v, _)| *v).collect::<Vec<_>>()
+        );
+        if let Some(start_index) = start_index {
+            Ok(manifests.split_off(start_index))
+        } else {
+            Ok(vec![])
+        }
+    }
+
+    /// Common implementation for fetching manifests from entries in parallel.
+    pub(crate) async fn fetch_manifests_from_entries(
+        &self,
+        entries: Vec<(ManifestVersion, Entry)>,
+    ) -> Result<Vec<(ManifestVersion, Vec<u8>)>> {
+        if entries.is_empty() {
+            return Ok(vec![]);
+        }
+
+        // TODO(weny): Make it configurable.
+        let semaphore = Semaphore::new(FETCH_MANIFEST_PARALLELISM);
+
+        let tasks = entries.iter().map(|(v, entry)| async {
+            // Safety: semaphore must exist.
+            let _permit = semaphore.acquire().await.unwrap();
+
+            let cache_key = entry.path();
+            // Try to get from cache first
+            if let Some(data) = get_from_cache(self.manifest_cache.as_ref(), cache_key).await {
+                return Ok((*v, data));
+            }
+
+            // Fetch from remote object store
+            let compress_type = file_compress_type(entry.name());
+            let bytes = self
+                .object_store
+                .read(entry.path())
+                .await
+                .context(OpenDalSnafu)?;
+            let data = compress_type
+                .decode(bytes)
+                .await
+                .context(DecompressObjectSnafu {
+                    compress_type,
+                    path: entry.path(),
+                })?;
+
+            // Add to cache
+            put_to_cache(self.manifest_cache.as_ref(), cache_key.to_string(), &data).await;
+
+            Ok((*v, data))
+        });
+
+        try_join_all(tasks).await
+    }
+
+    /// Fetch all manifests in concurrent, and return the manifests in range [start_version, end_version)
+    ///
+    /// **Notes**: This function is no guarantee to return manifests from the `start_version` strictly.
+    /// Uses [fetch_manifests_strict_from](DeltaStorage::fetch_manifests_strict_from) to get manifests from the `start_version`.
+    pub async fn fetch_manifests(
+        &self,
+        start_version: ManifestVersion,
+        end_version: ManifestVersion,
+    ) -> Result<Vec<(ManifestVersion, Vec<u8>)>> {
+        let manifests = self.scan(start_version, end_version).await?;
+        self.fetch_manifests_from_entries(manifests).await
+    }
+
+    /// Save the delta manifest file.
+    pub async fn save(&mut self, version: ManifestVersion, bytes: &[u8]) -> Result<()> {
+        let path = self.delta_file_path(version);
+        debug!("Save log to manifest storage, version: {}", version);
+        let data = self
+            .compress_type
+            .encode(bytes)
+            .await
+            .context(CompressObjectSnafu {
+                compress_type: self.compress_type,
+                path: &path,
+            })?;
+        let delta_size = data.len();
+
+        write_and_put_cache(
+            &self.object_store,
+            self.manifest_cache.as_ref(),
+            &path,
+            data,
+        )
+        .await?;
+        self.delta_tracker.record(version, delta_size as u64);
+
+        Ok(())
+    }
+}
+
+#[cfg(test)]
+impl<T: Tracker> DeltaStorage<T> {
+    pub fn set_compress_type(&mut self, compress_type: CompressionType) {
+        self.compress_type = compress_type;
+    }
+}
--- a/src/mito2/src/manifest/storage/size_tracker.rs
+++ b/src/mito2/src/manifest/storage/size_tracker.rs
@@ -0,0 +1,130 @@
+// Copyright 2023 Greptime Team
+//
+// Licensed under the Apache License, Version 2.0 (the "License");
+// you may not use this file except in compliance with the License.
+// You may obtain a copy of the License at
+//
+//     http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+use std::collections::HashMap;
+use std::fmt::Debug;
+use std::sync::atomic::{AtomicU64, Ordering};
+use std::sync::{Arc, RwLock};
+
+use store_api::ManifestVersion;
+
+/// Key to identify a manifest file.
+#[derive(Debug, Clone, Copy, Eq, PartialEq, Hash)]
+pub(crate) enum FileKey {
+    /// A delta file (`.json`).
+    Delta(ManifestVersion),
+    /// A checkpoint file (`.checkpoint`).
+    Checkpoint(ManifestVersion),
+}
+
+pub(crate) trait Tracker: Send + Sync + Debug {
+    fn record(&self, version: ManifestVersion, size: u64);
+}
+
+#[derive(Debug, Clone)]
+pub struct CheckpointTracker {
+    size_tracker: SizeTracker,
+}
+
+impl Tracker for CheckpointTracker {
+    fn record(&self, version: ManifestVersion, size: u64) {
+        self.size_tracker.record(FileKey::Checkpoint(version), size);
+    }
+}
+
+#[derive(Debug, Clone)]
+pub struct DeltaTracker {
+    size_tracker: SizeTracker,
+}
+
+impl Tracker for DeltaTracker {
+    fn record(&self, version: ManifestVersion, size: u64) {
+        self.size_tracker.record(FileKey::Delta(version), size);
+    }
+}
+
+#[derive(Debug, Clone)]
+pub struct NoopTracker;
+
+impl Tracker for NoopTracker {
+    fn record(&self, _version: ManifestVersion, _size: u64) {
+        // noop
+    }
+}
+
+#[derive(Debug, Clone, Default)]
+pub(crate) struct SizeTracker {
+    file_sizes: Arc<RwLock<HashMap<FileKey, u64>>>,
+    total_size: Arc<AtomicU64>,
+}
+
+impl SizeTracker {
+    /// Returns a new [SizeTracker].
+    pub fn new(total_size: Arc<AtomicU64>) -> Self {
+        Self {
+            file_sizes: Arc::new(RwLock::new(HashMap::new())),
+            total_size,
+        }
+    }
+
+    /// Returns the manifest tracker.
+    pub(crate) fn manifest_tracker(&self) -> DeltaTracker {
+        DeltaTracker {
+            size_tracker: self.clone(),
+        }
+    }
+
+    /// Returns the checkpoint tracker.
+    pub(crate) fn checkpoint_tracker(&self) -> CheckpointTracker {
+        CheckpointTracker {
+            size_tracker: self.clone(),
+        }
+    }
+
+    /// Records a delta file size.
+    pub(crate) fn record_delta(&self, version: ManifestVersion, size: u64) {
+        self.record(FileKey::Delta(version), size);
+    }
+
+    /// Records a checkpoint file size.
+    pub(crate) fn record_checkpoint(&self, version: ManifestVersion, size: u64) {
+        self.record(FileKey::Checkpoint(version), size);
+    }
+
+    /// Removes a file from tracking.
+    pub(crate) fn remove(&self, key: &FileKey) {
+        if let Some(size) = self.file_sizes.write().unwrap().remove(key) {
+            self.total_size.fetch_sub(size, Ordering::Relaxed);
+        }
+    }
+
+    /// Returns the total tracked size.
+    pub(crate) fn total(&self) -> u64 {
+        self.total_size.load(Ordering::Relaxed)
+    }
+
+    /// Resets all tracking.
+    pub(crate) fn reset(&self) {
+        self.file_sizes.write().unwrap().clear();
+        self.total_size.store(0, Ordering::Relaxed);
+    }
+
+    fn record(&self, key: FileKey, size: u64) {
+        // Remove the old size if present
+        if let Some(old_size) = self.file_sizes.write().unwrap().insert(key, size) {
+            self.total_size.fetch_sub(old_size, Ordering::Relaxed);
+        }
+        self.total_size.fetch_add(size, Ordering::Relaxed);
+    }
+}
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Weny Xu	b25f24c6fe	feat(meta-srv): add repartition procedure skeleton (#7487 ) Signed-off-by: WenyXu <wenymedia@gmail.com>	2025-12-26 11:23:47 +00:00
Lei, HUANG	7bc0934eb3	refactor(mito2): make MemtableStats fields public (#7488 ) Change visibility of estimated_bytes, time_range, max_sequence, and series_count fields from private to public for external access. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>	2025-12-26 09:57:18 +00:00
Yingwen	89b9469250	feat: Implement per range stats for bulk memtable (#7486 ) * feat: implement per range stats for MemtableRange Signed-off-by: evenyag <realevenyag@gmail.com> * refactor: extract methods to MemtableRanges Signed-off-by: evenyag <realevenyag@gmail.com> * fix: simple bulk memtable set other fields in stats Signed-off-by: evenyag <realevenyag@gmail.com> * refactor: use time_index_type() Signed-off-by: evenyag <realevenyag@gmail.com> * refactor: use time index type Signed-off-by: evenyag <realevenyag@gmail.com> --------- Signed-off-by: evenyag <realevenyag@gmail.com>	2025-12-26 07:24:11 +00:00
Weny Xu	518a4e013b	refactor(mito2): reorganize manifest storage into modular components (#7483 ) * refactor(mito2): reorganize manifest storage into modular components Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: apply suggestions Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: apply suggestions from CR Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: sort Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: fmt Signed-off-by: WenyXu <wenymedia@gmail.com> --------- Signed-off-by: WenyXu <wenymedia@gmail.com>	2025-12-26 02:24:27 +00:00
Lei, HUANG	fffad499ca	chore: mount cargo git cache in docker builds (#7484 ) Mount the cargo git cache directory (${HOME}/.cargo/git) in docker build containers to improve rebuild performance by caching git dependencies. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>	2025-12-26 01:56:11 +00:00
yihong	0c9f58316d	fix: more wait time for sqlness start and better message (#7485 ) Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-12-26 01:55:20 +00:00
ZonaHe	4f290111db	feat: update dashboard to v0.11.11 (#7481 ) Co-authored-by: sunchanglong <sunchanglong@users.noreply.github.com>	2025-12-25 18:43:14 +00:00
Weny Xu	294f19fa1d	feat(metric-engine): support sync logical regions from source region (#7438 ) * chore: move file Signed-off-by: WenyXu <wenymedia@gmail.com> * feat(metric-engine): support sync logical regions from source region Signed-off-by: WenyXu <wenymedia@gmail.com> * fix: fix unit tests Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: apply suggestions Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: add comments Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: add comments Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: apply suggestions from CR Signed-off-by: WenyXu <wenymedia@gmail.com> --------- Signed-off-by: WenyXu <wenymedia@gmail.com>	2025-12-25 09:06:58 +00:00
ZonaHe	be530ac1de	feat: update dashboard to v0.11.10 (#7479 ) Co-authored-by: sunchanglong <sunchanglong@users.noreply.github.com>	2025-12-25 04:27:10 +00:00
jeremyhi	434b4d8183	feat: refine the MemoryGuard (#7466 ) * feat: refine MemoryGuard Signed-off-by: jeremyhi <fengjiachun@gmail.com> * chore: add test Signed-off-by: jeremyhi <fengjiachun@gmail.com> --------- Signed-off-by: jeremyhi <fengjiachun@gmail.com>	2025-12-25 04:09:32 +00:00
Lei, HUANG	3ad0b60c4b	chore(metric-engine): set default compaction time window for data region (#7474 ) chore: set compaction time window for metric engine data region to 1 day by default Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>	2025-12-25 03:55:17 +00:00
Ning Sun	19ae845225	refactor: cache server memory limiter for other components (#7470 )	2025-12-25 03:46:50 +00:00
dennis zhuang	3866512cf6	feat: add more MySQL-compatible string functions (#7454 ) * feat: add more mysql string functions Signed-off-by: Dennis Zhuang <killme2008@gmail.com> * refactor: use datafusion aliasing mechanism, close #7415 Signed-off-by: Dennis Zhuang <killme2008@gmail.com> * chore: comment Signed-off-by: Dennis Zhuang <killme2008@gmail.com> * fix: comment and style Signed-off-by: Dennis Zhuang <killme2008@gmail.com> --------- Signed-off-by: Dennis Zhuang <killme2008@gmail.com>	2025-12-25 03:28:57 +00:00
LFC	d4870ee2af	fix: typo in AI-assisted contributions policy (#7472 ) * Fix typo in AI-assisted contributions policy * Update project name from DataFusion to GreptimeDB	2025-12-25 03:03:14 +00:00
discord9	aea4e9fa55	fix: RemovedFiles deser compatibility (#7475 ) * fix: compat for RemovedFiles Signed-off-by: discord9 <discord9@163.com> * cr Signed-off-by: discord9 <discord9@163.com> --------- Signed-off-by: discord9 <discord9@163.com>	2025-12-25 02:50:34 +00:00
AntiTopQuark	cea578244c	fix(compaction): unify behavior of database compaction options with TTL (#7402 ) * fix: fix dynamic compactiom option,unify behavior of database compaction options with TTL option Signed-off-by: AntiTopQuark <AntiTopQuark1350@outlook.com> * fix unit test Signed-off-by: AntiTopQuark <AntiTopQuark1350@outlook.com> * add debug log Signed-off-by: AntiTopQuark <AntiTopQuark1350@outlook.com> --------- Signed-off-by: AntiTopQuark <AntiTopQuark1350@outlook.com>	2025-12-25 02:34:42 +00:00
Weny Xu	e1b18614ee	feat(mito2): implement `ApplyStagingManifest` request handling (#7456 ) * feat(mito2): implement `ApplyStagingManifest` request handling Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: apply suggestions from CR Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: fmt Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: apply suggestions from CR Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: apply suggestions from CR Signed-off-by: WenyXu <wenymedia@gmail.com> * fix: fix logic Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: update proto Signed-off-by: WenyXu <wenymedia@gmail.com> --------- Signed-off-by: WenyXu <wenymedia@gmail.com>	2025-12-24 09:05:09 +00:00
Frost Ming	4bae75ccdb	docs: refer to the correct project name in AI guidelines (#7471 ) doc: refer to the correct project name in AI guidelines	2025-12-24 07:58:36 +00:00
LFC	dc9f3a702e	refactor: explicitly define json struct to ingest jsonbench data (#7462 ) ingest jsonbench data Signed-off-by: luofucong <luofc@foxmail.com>	2025-12-24 07:30:22 +00:00
Weny Xu	2d9967b981	fix(mito2): pass partition expr explicitly to flush task for region (#7461 ) * fix(mito2): pass partition expr explicitly to flush task for staging mode Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: apply suggestions from CR Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: rename Signed-off-by: WenyXu <wenymedia@gmail.com> --------- Signed-off-by: WenyXu <wenymedia@gmail.com>	2025-12-24 04:18:06 +00:00
discord9	dec0d522f8	feat: gc versioned index (#7412 ) * feat: add index version to file ref Signed-off-by: discord9 <discord9@163.com> * refactor wip Signed-off-by: discord9 <discord9@163.com> * wip Signed-off-by: discord9 <discord9@163.com> * update gc worker Signed-off-by: discord9 <discord9@163.com> * stuff Signed-off-by: discord9 <discord9@163.com> * gc report for index files Signed-off-by: discord9 <discord9@163.com> * fix: type Signed-off-by: discord9 <discord9@163.com> * stuff Signed-off-by: discord9 <discord9@163.com> * chore: clippy Signed-off-by: discord9 <discord9@163.com> * chore: metrics Signed-off-by: discord9 <discord9@163.com> * typo Signed-off-by: discord9 <discord9@163.com> * typo Signed-off-by: discord9 <discord9@163.com> * chore: naming Signed-off-by: discord9 <discord9@163.com> * docs: update explain Signed-off-by: discord9 <discord9@163.com> * test: parse file id/type from file path Signed-off-by: discord9 <discord9@163.com> * chore: change parse method visibility to crate Signed-off-by: discord9 <discord9@163.com> * pcr Signed-off-by: discord9 <discord9@163.com> * pcr Signed-off-by: discord9 <discord9@163.com> * chore Signed-off-by: discord9 <discord9@163.com> --------- Signed-off-by: discord9 <discord9@163.com>	2025-12-24 03:07:53 +00:00
dennis zhuang	17e2b98132	docs: rfc for vector index (#7353 ) * docs: rfc for vector index Signed-off-by: Dennis Zhuang <killme2008@gmail.com> * chore: explain why choose USearch and distributed query Signed-off-by: Dennis Zhuang <killme2008@gmail.com> * fix: row id mapping Signed-off-by: Dennis Zhuang <killme2008@gmail.com> * refine proposal Signed-off-by: Ruihang Xia <waynestxia@gmail.com> * rename rfc file Signed-off-by: Ruihang Xia <waynestxia@gmail.com> --------- Signed-off-by: Dennis Zhuang <killme2008@gmail.com> Signed-off-by: Ruihang Xia <waynestxia@gmail.com> Co-authored-by: Ruihang Xia <waynestxia@gmail.com>	2025-12-24 02:54:25 +00:00
Weny Xu	ee86987912	feat(repartition): implement enter staging region state (#7447 ) * feat(repartition): implement enter staging region state Signed-off-by: WenyXu <wenymedia@gmail.com> * chore: apply suggestions from CR Signed-off-by: WenyXu <wenymedia@gmail.com> --------- Signed-off-by: WenyXu <wenymedia@gmail.com>	2025-12-24 02:50:27 +00:00
Ruihang Xia	0cea58c642	docs: about AI-assisted contributions (#7464 ) Signed-off-by: Ruihang Xia <waynestxia@gmail.com>	2025-12-23 14:20:21 +00:00
discord9	fdedbb8261	fix: part sort share same topk dyn filter&early stop use dyn filter (#7460 ) * fix: part sort share same topk dyn filter Signed-off-by: discord9 <discord9@163.com> * test: one Signed-off-by: discord9 <discord9@163.com> * feat: use dyn filter properly instead Signed-off-by: discord9 <discord9@163.com> * c Signed-off-by: discord9 <discord9@163.com> * docs: explain why dyn filter work Signed-off-by: discord9 <discord9@163.com> * chore: after rebase fix Signed-off-by: discord9 <discord9@163.com> --------- Signed-off-by: discord9 <discord9@163.com>	2025-12-23 09:24:55 +00:00
Lanqing Yang	8d9afc83e3	feat: allow auto schema creation for pg (#7459 ) Signed-off-by: lyang24 <lanqingy93@gmail.com>	2025-12-23 08:55:24 +00:00
LFC	625fdd09ea	refactor!: remove not working metasrv cli option (#7446 ) Signed-off-by: luofucong <luofc@foxmail.com>	2025-12-23 06:55:17 +00:00
discord9	b3bc3c76f1	feat: file range dynamic filter (#7441 ) * feat: add dynamic filtering support in file range and predicate handling Signed-off-by: discord9 <discord9@163.com> * clippy Signed-off-by: discord9 <discord9@163.com> * c Signed-off-by: discord9 <discord9@163.com> * c Signed-off-by: discord9 <discord9@163.com> * per review Signed-off-by: discord9 <discord9@163.com> * per review Signed-off-by: discord9 <discord9@163.com> * pcr Signed-off-by: discord9 <discord9@163.com> * c Signed-off-by: discord9 <discord9@163.com> --------- Signed-off-by: discord9 <discord9@163.com>	2025-12-23 06:15:30 +00:00
yihong	342eb47e19	fix: close issue #7457 guard against empty buffer (#7458 ) * fix: close issue #7457 guard against empty buffer Signed-off-by: yihong0618 <zouzou0208@gmail.com> * fix: add unittests for it Signed-off-by: yihong0618 <zouzou0208@gmail.com> --------- Signed-off-by: yihong0618 <zouzou0208@gmail.com>	2025-12-23 03:11:00 +00:00
jeremyhi	6a6b34c709	feat!: memory limiter unification write path (#7437 ) * feat: remove option max_in_flight_write_bytes Signed-off-by: jeremyhi <fengjiachun@gmail.com> * feat: replace RequestMemoryLimiter Signed-off-by: jeremyhi <fengjiachun@gmail.com> * chore: add integration test Signed-off-by: jeremyhi <fengjiachun@gmail.com> * chore: fix test Signed-off-by: jeremyhi <fengjiachun@gmail.com> * fix: by AI comment Signed-off-by: jeremyhi <fengjiachun@gmail.com> * refactor: global permit pool on writing Signed-off-by: jeremyhi <fengjiachun@gmail.com> * chore: by ai comment Signed-off-by: jeremyhi <fengjiachun@gmail.com> --------- Signed-off-by: jeremyhi <fengjiachun@gmail.com>	2025-12-23 02:18:49 +00:00
Lei, HUANG	a8b512dded	chore: expose symbols (#7451 ) * chore/expose-symbols: ### Commit Message Enhance `merge_and_dedup` Functionality in `flush.rs` - Function Signature Update: Modified the `merge_and_dedup` function to accept `append_mode` and `merge_mode` as separate parameters instead of using `options`. - Function Accessibility: Changed the visibility of `merge_and_dedup` to `pub` to allow external access. - Function Calls Update: Updated calls to `merge_and_dedup` within `memtable_flat_sources` to align with the new function signature, passing `options.append_mode` and `options.merge_mode()` directly. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * chore/expose-symbols: ### Add Merge and Deduplication Functionality - File: `src/mito2/src/flush.rs` - Introduced `merge_and_dedup` function to merge multiple record batch iterators and apply deduplication based on specified modes. - Added detailed documentation for the function, explaining its arguments, behavior, and usage examples. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> --------- Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>	2025-12-22 05:39:03 +00:00
Ning Sun	bd8ffd3db9	feat: pgwire 0.37 (#7443 )	2025-12-22 05:13:39 +00:00