* chore/expose-symbols:
### Commit Message
Enhance `merge_and_dedup` Functionality in `flush.rs`
- **Function Signature Update**: Modified the `merge_and_dedup` function to accept `append_mode` and `merge_mode` as separate parameters instead of using `options`.
- **Function Accessibility**: Changed the visibility of `merge_and_dedup` to `pub` to allow external access.
- **Function Calls Update**: Updated calls to `merge_and_dedup` within `memtable_flat_sources` to align with the new function signature, passing `options.append_mode` and `options.merge_mode()` directly.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/expose-symbols:
### Add Merge and Deduplication Functionality
- **File**: `src/mito2/src/flush.rs`
- Introduced `merge_and_dedup` function to merge multiple record batch iterators and apply deduplication based on specified modes.
- Added detailed documentation for the function, explaining its arguments, behavior, and usage examples.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore: update proto to include native histogram
* ci: add a CI check to ensure whitelisted dependencies are using their main branch
* chore: add changes to Cargo.toml to trigger CI
* chore: update proto
* test: update test to include histogram
* refactor(cli): unify storage configuration for export command
- Utilize ObjectStoreConfig to unify storage configuration for export command
- Support export command for Fs, S3, OSS, GCS and Azblob
- Fix the Display implementation for SecretString always returned the string
"SecretString([REDACTED])" even when the internal secret was empty.
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
* refactor(cli): unify storage configuration for export command
- Change the encapsulation permissions of each configuration
options for every storage backend to public access.
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
Co-authored-by: WenyXu <wenymedia@gmail.com>
* refactor(cli): unify storage configuration for export command
- Update the implementation of ObjectStoreConfig::build_xxx() using macro solutions
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
Co-authored-by: WenyXu <wenymedia@gmail.com>
* refactor(cli): unify storage configuration for export command
- Introduce config validation for each storage type
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
* refactor(cli): unify storage configuration for export command
- Enable trait-based polymorphism for storage type handling
(from inherent impl to trait impl)
- Extract helper functions to reduce code duplication
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
* refactor(cli): unify storage configuration for export command
- Improve SecretString handling and validation
(Distinguishing between "not provided" and "empty string")
- Add validation when using filesystem storage
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
* refactor(cli): unify storage configuration for export command
- Refactor storage field validation with macro
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
* refactor(cli): unify storage configuration for export command
- support GCS Application Default Credentials (like GKE, Cloud Run, or local development with ) in export
(Enabling ADC without validating or to be present)
(Making optional in GCS validation (defaults to https://storage.googleapis.com))
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
* refactor(cli): unify storage configuration for export command
This commit refactors the validation logic for object store configurations in the CLI to leverage clap features and reduce boilerplate.
Key changes:
- Update wrap_with_clap_prefix macro to use clap's requires attribute.
This ensures that storage-specific options (e.g., --s3-bucket) are only accepted when the corresponding backend is enabled (e.g., --s3).
- Simplify FieldValidator trait by removing the is_provided method, as dependency checks are now handled by clap.
- Introduce validate_backend! macro to standardize the validation of required fields for enabled backends.
- Refactor ExportCommand to remove explicit validation calls (validate_s3, etc.) and rely on the validation within backend constructors.
- Add integration tests for ExportCommand to verify build success with S3, OSS, GCS, and Azblob configurations.
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
* refactor(cli): unify storage configuration for export command
- Use macros to simplify storage export implementation
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
Co-authored-by: WenyXu <wenymedia@gmail.com>
* refactor(cli): unify storage configuration for export command
- Rollback StorageExport trait implementation to not using macro for better code clarity and maintainability
- Introduce format_uri helper function to unify URI formatting logic
- Fix OSS URI path bug inherited from legacy code
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
Co-authored-by: WenyXu <wenymedia@gmail.com>
* refactor(cli): unify storage configuration for export command
- Remove unnecessary async_trait
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
Co-authored-by: jeremyhi <jiachun_feng@proton.me>
---------
Signed-off-by: McKnight22 <tao.wang.22@outlook.com>
Co-authored-by: WenyXu <wenymedia@gmail.com>
Co-authored-by: jeremyhi <jiachun_feng@proton.me>
* refactor/expose-symbols:
## Refactor `bulk/part.rs` to Simplify Mutation Handling
- Removed the `mutations_to_record_batch` function and its associated helper functions, including `ArraysSorter`, `timestamp_array_to_iter`, and `binary_array_to_dictionary`, to simplify the mutation handling logic in `bulk/part.rs`.
- Deleted related test functions `check_binary_array_to_dictionary` and `check_mutations_to_record_batches` from the test module, along with their associated test cases.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/expose-symbols:
### Commit Message
**Refactor and Enhance Deduplication Logic**
- **`flush.rs`**: Refactored `maybe_dedup_one` function to accept `append_mode` and `merge_mode` as parameters instead of `RegionOptions`. This change enhances flexibility in deduplication logic.
- **`memtable/bulk.rs`**: Made `BulkRangeIterBuilder` struct and its fields public to allow external access and modification, improving extensibility.
- **`sst.rs`**: Corrected a typo in the schema documentation, changing `__prmary_key` to `__primary_key` for clarity and accuracy.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: adds the foundational types and SQL parsing support for vector index
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* refactor: by suggestions
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: ensure index option values must be greater than zero
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: validate connectivity strictly
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: compile error
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: disable SIMD for ci
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix/flight-stuck-on-first-message:
**Refactor GRPC Stream Handling and Table Resolution**
- **`grpc.rs`**: Refactored the `GrpcQueryHandler` to resolve table references and check permissions only once per stream, improving efficiency. Introduced a mechanism to handle table resolution and permission checks after receiving the first `RecordBatch`.
- **`flight.rs`**: Enhanced `PutRecordBatchRequestStream` to manage stream states (`Init` and `Ready`) for better handling of schema and table name extraction. Improved error handling and logging for unexpected flight messages.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore: add some doc
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: support function aliases and add MySQL-compatible aliases
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: get_table_function_source
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* refactor: add function_alias mod
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: license
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: collect per file metrics
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: divide build_cost to build_part_cost and build_reader_cost
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: limit the file metrics num to display
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: use sorted iter to get sorted files
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: output metrics in desc order
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat/allow-one-to-many-pipeline:
### Enhance Pipeline Processing for One-to-Many Transformations
- **Support One-to-Many Transformations**:
- Updated `processor.rs`, `etl.rs`, `vrl_processor.rs`, and `greptime.rs` to handle one-to-many transformations by allowing VRL processors to return arrays, expanding each element into separate rows.
- Introduced `transform_array_elements` and `values_to_rows` functions to facilitate this transformation.
- **Error Handling Enhancements**:
- Added new error types in `error.rs` to handle cases where array elements are not objects and for transformation failures.
- **Testing Enhancements**:
- Added tests in `pipeline.rs` to verify one-to-many transformations, single object processing, and error handling for non-object array elements.
- **Context Management**:
- Modified `ctx_req.rs` to clone `ContextOpt` when adding rows, ensuring correct context management during transformations.
- **Server Pipeline Adjustments**:
- Updated `pipeline.rs` in `servers` to handle transformed outputs with one-to-many row expansions, ensuring correct row padding and request formation.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
Add one-to-many VRL pipeline test in `http.rs`
- Introduced `test_pipeline_one_to_many_vrl` to verify VRL processor's ability to expand a single input row into multiple output rows.
- Updated `http_tests!` macro to include the new test.
- Implemented test scenarios for single and multiple input rows, ensuring correct data transformation and row count validation.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
### Add Tests for VRL Pipeline Transformations
- **File:** `src/pipeline/src/etl.rs`
- Added tests for one-to-many VRL pipeline expansion to ensure multiple output rows from a single input.
- Introduced tests to verify backward compatibility for single object output.
- Implemented tests to confirm zero rows are produced from empty arrays.
- Added validation tests to ensure array elements must be objects.
- Developed tests for one-to-many transformations with table suffix hints from VRL.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
### Enhance Pipeline Transformation with Per-Row Table Suffixes
- **`src/pipeline/src/etl.rs`**: Updated `TransformedOutput` to include per-row table suffixes, allowing for more flexible routing of transformed data. Modified `PipelineExecOutput` and related methods to
handle the new structure.
- **`src/pipeline/src/etl/transform/transformer/greptime.rs`**: Enhanced `values_to_rows` to support per-row table suffix extraction and application.
- **`src/pipeline/tests/common.rs`** and **`src/pipeline/tests/pipeline.rs`**: Adjusted tests to validate the new per-row table suffix functionality, ensuring backward compatibility and correct behavior in
one-to-many transformations.
- **`src/servers/src/pipeline.rs`**: Modified `run_custom_pipeline` to process transformed outputs with per-row table suffixes, grouping rows by `(opt, table_name)` for insertion.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
### Update VRL Processor Type Checks
- **File:** `vrl_processor.rs`
- **Changes:** Updated type checking logic to use `contains_object()` and `contains_array()` methods instead of `is_object()` and `is_array()`. This change ensures
compatibility with VRL type inference that may return multiple possible types.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
- **Enhance Error Handling**: Added new error types `ArrayElementMustBeObjectSnafu` and `TransformArrayElementSnafu` to improve error handling in `etl.rs` and `greptime.rs`.
- **Refactor Error Usage**: Moved error usage declarations in `transform_array_elements` and `values_to_rows` functions to the top of the file for better organization in `etl.rs` and `greptime.rs`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
### Update `greptime.rs` to Enhance Error Handling
- **Error Handling**: Modified the `values_to_rows` function to handle invalid array elements based on the `skip_error` parameter. If `skip_error` is true, invalid elements are skipped; otherwise, an error is returned.
- **Testing**: Added unit tests in `greptime.rs` to verify the behavior of `values_to_rows` with different `skip_error` settings, ensuring correct processing of valid objects and appropriate error handling for invalid elements.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
### Commit Summary
- **Enhance `TransformedOutput` Structure**: Refactored `TransformedOutput` to use a `HashMap` for grouping rows by `ContextOpt`, allowing for per-row configuration options. Updated methods in `PipelineExecOutput` to support the new structure (`src/pipeline/src/etl.rs`).
- **Add New Transformation Method**: Introduced `transform_array_elements_to_hashmap` to handle array inputs with per-row `ContextOpt` in `HashMap` format (`src/pipeline/src/etl.rs`).
- **Update Pipeline Execution**: Modified `run_custom_pipeline` to process `TransformedOutput` using the new `HashMap` structure, ensuring rows are grouped by `ContextOpt` and table name (`src/servers/src/pipeline.rs`).
- **Add Tests for New Structure**: Implemented tests to verify the functionality of the new `HashMap` structure in `TransformedOutput`, including scenarios for one-to-many mapping, single object input, and empty arrays (`src/pipeline/src/etl.rs`).
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
### Refactor `values_to_rows` to Return `HashMap` Grouped by `ContextOpt`
- **`etl.rs`**:
- Updated `values_to_rows` to return a `HashMap` grouped by `ContextOpt` instead of a vector.
- Adjusted logic to handle single object and array inputs, ensuring rows are grouped by their `ContextOpt`.
- Modified functions to extract rows from default `ContextOpt` and apply table suffixes accordingly.
- **`greptime.rs`**:
- Enhanced `values_to_rows` to handle errors gracefully with `skip_error` logic.
- Added logic to group rows by `ContextOpt` for array inputs.
- **Tests**:
- Updated existing tests to validate the new `HashMap` return structure.
- Added a new test to verify correct grouping of rows by per-element `ContextOpt`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
### Refactor and Enhance Error Handling in ETL Pipeline
- **Refactored Functionality**:
- Replaced `transform_array_elements` with `transform_array_elements_by_ctx` in `etl.rs` to streamline transformation logic and improve error handling.
- Updated `values_to_rows` in `greptime.rs` to use `or_default` for cleaner code.
- **Enhanced Error Handling**:
- Introduced `unwrap_or_continue_if_err` macro in `etl.rs` to allow skipping errors based on pipeline context, improving robustness in data processing.
These changes enhance the maintainability and error resilience of the ETL pipeline.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-one-to-many-pipeline:
### Update `Row` Handling in ETL Pipeline
- **Refactor `Row` Type**: Introduced `RowWithTableSuffix` type alias to simplify handling of rows with optional table suffixes across the ETL pipeline.
- **Modify Function Signatures**: Updated function signatures in `etl.rs` and `greptime.rs` to use `RowWithTableSuffix` for better clarity and consistency.
- **Enhance Test Coverage**: Adjusted test logic in `greptime.rs` to align with the new `RowWithTableSuffix` type, ensuring correct grouping and processing of rows by TTL.
Files affected: `etl.rs`, `greptime.rs`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: regression with shortcutted statement on postgres extended query
* chore: typo fix
* feat: also add more type support for parameters
* chore: remove dbg print
* refactor/bulk-insert-service:
refactor: decode FlightData early in put_record_batch pipeline
- Move FlightDecoder usage from Inserter up to PutRecordBatchRequestStream,
passing decoded RecordBatch and schema bytes instead of raw FlightData.
- Eliminate redundant per-request decoding/encoding in Inserter; encode
once and reuse for all region requests.
- Streamline GrpcQueryHandler trait and implementations to accept
PutRecordBatchRequest containing pre-decoded data.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/bulk-insert-service:
feat: stream-based bulk insert with per-batch responses
- Introduce handle_put_record_batch_stream() to process Flight DoPut streams
- Resolve table & permissions once, yield (request_id, AffectedRows) per batch
- Replace loop-over-request with async-stream in frontend & server
- Make PutRecordBatchRequestStream public for cross-crate usage
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/bulk-insert-service:
fix: propagate request_id with errors in bulk insert stream
Changes the bulk-insert stream item type from
Result<(i64, AffectedRows), E> to (i64, Result<AffectedRows, E>)
so every emitted tuple carries the request_id even on failure,
letting callers correlate errors with the originating request.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/bulk-insert-service:
refactor: unify DoPut response stream to return DoPutResponse
Replace the tuple (i64, Result<AffectedRows>) with Result<DoPutResponse>
throughout the gRPC bulk-insert path so the handler, adapter and server
all speak the same type.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/bulk-insert-service:
feat: add elapsed_secs to DoPutResponse for bulk-insert timing
- DoPutResponse now carries elapsed_secs field
- Frontend measures and attaches insert duration
- Server observes GRPC_BULK_INSERT_ELAPSED metric from response
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/bulk-insert-service:
refactor: unify Bytes import in flight module
- Replace `bytes::Bytes` with `Bytes` alias for consistency
- Remove redundant `ProstBytes` alias
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/bulk-insert-service:
fix: terminate gRPC stream on error and optimize FlightData handling
- Stop retrying on stream errors in gRPC handler
- Replace Vec1 indexing with into_iter().next() for FlightData
- Remove redundant clones in bulk_insert and flight modules
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/bulk-insert-service:
Improve permission check placement in `grpc.rs`
- Moved the permission check for `BulkInsert` to occur before resolving the table reference in `GrpcQueryHandler` implementation.
- Ensures permission validation is performed earlier in the process, potentially avoiding unnecessary operations if permission is denied.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/bulk-insert-service:
**Refactor Bulk Insert Handling in gRPC**
- **`grpc.rs`**:
- Switched from `async_stream::stream` to `async_stream::try_stream` for error handling.
- Removed `body_size` parameter and added `flight_data` to `handle_bulk_insert`.
- Simplified error handling and permission checks in `GrpcQueryHandler`.
- **`bulk_insert.rs`**:
- Added `raw_flight_data` parameter to `handle_bulk_insert`.
- Calculated `body_size` from `raw_flight_data` and removed redundant encoding logic.
- **`flight.rs`**:
- Replaced `body_size` with `flight_data` in `PutRecordBatchRequest`.
- Updated memory usage calculation to include `flight_data` components.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/bulk-insert-service:
perf(bulk_insert): encode record batch once per datanode
Move FlightData encoding outside the per-region loop so the same
encoded bytes are reused when mask.select_all(), eliminating redundant
serialisation work.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/allow-custom-flight-service:
### Add Custom Flight Handler Support
- **`server.rs`**:
- Introduced a new field `flight_handler` in the `Services` struct to allow optional custom flight handler configuration.
- Added a method `with_flight_handler` to set the custom flight handler.
- Modified `build_grpc_server` to use the custom flight handler if provided, defaulting to `GreptimeRequestHandler` otherwise.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/allow-custom-flight-service:
### Make structs and enums public in `flight.rs`
- Changed visibility of `PutRecordBatchRequest` and `PutRecordBatchRequestStream` structs to public.
- Made `PutRecordBatchRequestStreamState` enum public.
- Updated fields within `PutRecordBatchRequest` and `PutRecordBatchRequestStream` to be public.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/change-tsid-gen:
perf(metric-engine): replace mur3 with fxhash for faster TSID generation
- Switches from mur3::Hasher128 to fxhash::FxHasher for TSID hashing
- Pre-computes label-name hash when no nulls are present, avoiding redundant work
- Adds fast-path for rows without nulls; falls back to slow path otherwise
- Updates Cargo.toml and lockfile to reflect dependency change
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/change-tsid-gen:
fix: only check primary-key labels for null when re-using cached hash
- Rename has_null() → has_null_labels() and restrict the check to the
primary-key columns so that non-label NULLs do not force a full
TSID re-computation.
- Update expected hashes in tests to match the new logic.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/change-tsid-gen:
test: add comprehensive TSID generation tests for label ordering and null handling
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/change-tsid-gen:
bench: add criterion benchmark for TSID generator
- Compare original mur3 vs current fxhash fast/slow paths
- Test 2, 5, 10 label sets plus null-value slow path
- Add mur3 & criterion dev-deps; register bench target
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/change-tsid-gen:
test: stabilize metric-engine tests by fixing non-deterministic row order
- Add ORDER BY to SELECTs in TTL tests to ensure consistent output
- Update expected __tsid values after hash function change
- Swap expected OTLP metric rows to match new ordering
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/change-tsid-gen:
refactor: simplify Default impls and remove redundant code
- Replace manual Default for TsidGenerator with derive
- Remove unnecessary into_iter() call
- Simplify Option::unwrap_or_else to unwrap_or
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat(mysql): add SHOW WARNINGS support and return warnings for unsupported SET variables
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat(function): add MySQL IF() function and PostgreSQL description functions for connector compatibility
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: show tables for mysql
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: partitions table in information_schema and add starrocks external catalog compatibility
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* refactor: async udf
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: set warnings
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: impl pg_my_temp_schema and make description functions simple
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* test: add test for issue 7313
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: apply suggestions
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: partition_expression and partition_description
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: test
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: unit tests
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: saerch_path only works for pg
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: improve warnings processing
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: warnings while writing affected rows and refactor
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: improve ShobjDescriptionFunction signature
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* refactor: array_to_boolean
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* chore: return meaningful message when content type mismatch in otel
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* refactor: extract duplicated code
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: use a new error for failing to decode loki request
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: gc ctx&procedure
Signed-off-by: discord9 <discord9@163.com>
* fix: handle region not found case
Signed-off-by: discord9 <discord9@163.com>
* docs: more explain&todo
Signed-off-by: discord9 <discord9@163.com>
* per review
Signed-off-by: discord9 <discord9@163.com>
* chore: add time for region gc
Signed-off-by: discord9 <discord9@163.com>
* fix: explain why loader for gc region should fail
Signed-off-by: discord9 <discord9@163.com>
---------
Signed-off-by: discord9 <discord9@163.com>
* feat: split batches by rule in build_flat_sources()
It checks the num_series and splits batches when the series cardinality
is low
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: panic when no num_series available
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: don't subtract file index if checking mem range
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: update comments and control flow
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: divide parquet and puffin index
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: download index files when we open the region
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: use different label for parquet/puffin
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: control parallelism and cache size by env
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: change gauge to counter
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: correct file type labels in file cache
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: move env to config and change cache ratio to percent
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: checks capacity before download and refine metrics
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: change open to return MitoRegionRef
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: extract download to FileCache
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: run load cache task in write cache
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: check region state before downloading files
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: update config docs and test
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: use file id from index_file_id to compute puffin key
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: skip loading cache in some states
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix/region-expire-state:
Refactor region state handling in compaction task and manifest updates
- Introduce a variable to hold the current region state for clarity in compaction task updates.
- Add an expected_region_state field to RegionEditResult to manage region state expectations during manifest handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/region-expire-state:
Refactor region state handling in compaction task
- Replace direct assignment of `RegionLeaderState::Writable` with dynamic state retrieval and conditional check for leader state.
- Modify `RegionEditResult` to include a flag `update_region_state` instead of `expected_region_state` to indicate if the region state should be updated to writable.
- Adjust handling of `RegionEditResult` in `handle_manifest` to conditionally update region state based on the new flag.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-fuzz-input-override:
Add environment override for fuzzing parameters and seed values
- Implement `get_fuzz_override` function to read override values from environment variables for fuzzing parameters.
- Allow overriding `SEED`, `ACTIONS`, `ROWS`, `TABLES`, `COLUMNS`, `INSERTS`, and `PARTITIONS` in various fuzzing targets.
- Introduce new constants `GT_FUZZ_INPUT_MAX_PARTITIONS` and `FUZZ_OVERRIDE_PREFIX`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-fuzz-input-override: Remove GT_FUZZ_INPUT_MAX_PARTITIONS constant and usage from fuzzing utils and tests
• Deleted the GT_FUZZ_INPUT_MAX_PARTITIONS constant from fuzzing utility functions.
• Updated FuzzInput struct in fuzz_migrate_mito_regions.rs to use a hardcoded range instead of get_gt_fuzz_input_max_partitions for determining the number of partitions.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-fuzz-input-override:
Improve fuzzing documentation with environment variable overrides
Enhanced the fuzzing instructions in the README to include guidance on how to override fuzz input using environment variables, providing an example for better clarity.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat(expr): support vec_elem_avg function
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: support vec_avg function
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* test: add more query test for avg aggregator
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* fix: fix the merge batch mode
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* refactor: use sum and count as state for avg function
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* refactor: refactor merge batch mode for avg function
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: add additional vector restrictions for validation
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
---------
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
Co-authored-by: Yingwen <realevenyag@gmail.com>
* fix/pick-continue:
### Add Tests for TWCS Compaction Logic
- **`twcs.rs`**:
- Modified the logic in `TwcsPicker` to handle cases with zero runs by using `continue` instead of `return`.
- Added two new test cases: `test_build_output_multiple_windows_with_zero_runs` and `test_build_output_single_window_zero_runs` to verify the behavior of the compaction logic when there are zero runs in
the windows.
- **`memtable_util.rs`**:
- Removed unused import `PredicateGroup`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: clippy
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/pick-continue:
### Commit Message
Enhance Compaction Process with Expired SST Handling and Testing
- **`compactor.rs`**:
- Introduced handling for expired SSTs by updating the manifest immediately upon task completion.
- Added new test cases to verify the handling of expired SSTs and manifest updates.
- **`task.rs`**:
- Implemented `remove_expired` function to handle expired SSTs by updating the manifest and notifying the region worker loop.
- Refactored `handle_compaction` to `handle_expiration_and_compaction` to integrate expired SST removal before merging inputs.
- Added logging and error handling for expired SST removal process.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/progressive-compaction:
**Enhance Compaction Task Error Handling**
- Updated `task.rs` to conditionally execute the removal of expired SST files only when they exist, improving error handling and performance.
- Added a check for non-empty `expired_ssts` before initiating the removal process, ensuring unnecessary operations are avoided.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/progressive-compaction:
### Refactor `DefaultCompactor` to Extract `merge_single_output` Method
- **File**: `src/mito2/src/compaction/compactor.rs`
- Extracted the logic for merging a single compaction output into SST files into a new method `merge_single_output` within the `DefaultCompactor` struct.
- Simplified the `merge_ssts` method by utilizing the new `merge_single_output` method, reducing code duplication and improving maintainability.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/progressive-compaction:
### Add Max Background Compaction Tasks Configuration
- **`compaction.rs`**: Added `max_background_compactions` to the compaction scheduler to limit background tasks.
- **`compaction/compactor.rs`**: Removed immediate manifest update logic after task completion.
- **`compaction/picker.rs`**: Introduced `max_background_tasks` parameter in `new_picker` to control task limits.
- **`compaction/twcs.rs`**: Updated `TwcsPicker` to include `max_background_tasks` and truncate inputs exceeding this limit. Added related test cases to ensure functionality.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/pick-continue:
### Improve Error Handling and Task Management in Compaction
- **`task.rs`**: Enhanced error handling in `remove_expired` function by logging errors without halting the compaction process. Removed the return of `Result` type and added detailed logging for various
failure scenarios.
- **`twcs.rs`**: Adjusted task management logic by removing input truncation based on `max_background_tasks` and instead discarding remaining tasks if the output size exceeds the limit. This ensures better
control over task execution and resource management.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/pick-continue:
### Add Unit Tests for Compaction Task and TWCS Picker
- **`task.rs`**: Added unit tests to verify the behavior of `PickerOutput` with and without expired SSTs.
- **`twcs.rs`**: Introduced tests for `TwcsPicker` to ensure correct handling of `max_background_tasks` during compaction, including scenarios with and without task truncation.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/pick-continue:
**Improve Error Handling and Notification in Compaction Task**
- **File:** `task.rs`
- Changed log level from `warn` to `error` for manifest update failures to enhance error visibility.
- Refactored the notification mechanism for expired file removal by using `BackgroundNotify::RegionEdit` with `RegionEditResult` to streamline the process.
- Simplified error handling by consolidating match cases into a single `if let Err` block for better readability and maintainability.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* test: add tests for scanning append mode before flush
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: extract a function maybe_dedup_one
Signed-off-by: evenyag <realevenyag@gmail.com>
* ci: add flat format to docs.yml so we can make it required later
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* mito2: add unit test for flat single-range append_mode dedup behavior
Verify memtable_flat_sources skips dedup when append_mode is true and
performs dedup otherwise for single-range flat memtables, preventing
regressions in the new append_mode path.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/flat-source-merge:
### Improve Column Metadata Extraction Logic
- **File**: `src/common/meta/src/ddl/utils.rs`
- Modified the `extract_column_metadatas` function to use `swap_remove` for extracting the first schema and decode column metadata for comparison instead of raw bytes. This ensures that the extension map is considered during
verification, enhancing the robustness of metadata consistency checks across datanodes.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: add arrow json extension type
* feat: add json structure settings to extension type
* refactor: store json structure settings as extension metadata
* chore: make binary an acceptable type for extension
* chore/add-region-insert-failure-metric: Add metric for failed insert requests to region server in datanode module
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/add-region-insert-failure-metric:
Add metric for tracking failed region server requests
- Introduce a new metric `REGION_SERVER_REQUEST_FAILURE_COUNT` to count failed region server requests.
- Update `REGION_SERVER_INSERT_FAIL_COUNT` metric description for consistency.
- Implement error handling in `RegionServerHandler` to increment the new failure metric on request errors.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: potential failure in the test_index_build_type_compact test
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: relax timestamp checking in test_timestamp_default_now
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat/objbench-subcmd:
### Add Object Storage Benchmark Tool and Update Dependencies
- **`Cargo.lock` & `Cargo.toml`**: Added dependencies for `colored`, `parquet`, and `pprof` to support new features.
- **`datanode.rs`**: Introduced `ObjbenchCommand` for benchmarking object storage, including command-line options for configuration and execution. Added `StorageConfig` and `StorageConfigWrapper` for storage engine configuration.
- **`datanode.rs`**: Implemented a stub for `build_object_store` function to initialize object storage.
These changes introduce a new subcommand for object storage benchmarking and update dependencies to support additional functionality.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* init
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: code style and clippy
* feat/objbench-subcmd:
Improve error handling in `objbench.rs`
- Enhanced error handling in `parse_config` and `parse_file_dir_components` functions by replacing `unwrap` with `OptionExt` and `context` for better error messages.
- Updated `build_access_layer_simple` and `build_cache_manager` functions to use `map_err` for more descriptive error handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore: rebase main
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
again
false by default
test: config api
refactor: per code review
less info!
even less info!!
docs: gc regions instr
refactor: grp by region id
per code review
per review
error handling?
test: fix
todos
aft rebase fix
after refactor
Signed-off-by: discord9 <discord9@163.com>
* feat/gdump:
### Add Support for Jemalloc Gdump Flag
- **`jemalloc.rs`**: Introduced `PROF_GDUMP` constant and added functions `set_gdump_active` and `is_gdump_active` to manage the gdump flag.
- **`error.rs`**: Added error handling for reading and updating the jemalloc gdump flag with `ReadGdump` and `UpdateGdump` errors.
- **`lib.rs`**: Exposed `is_gdump_active` and `set_gdump_active` functions for non-Windows platforms.
- **`http.rs`**: Added HTTP routes for checking and toggling the jemalloc gdump flag status.
- **`mem_prof.rs`**: Implemented handlers `gdump_toggle_handler` and `gdump_status_handler` for managing gdump flag via HTTP requests.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* Update docs/how-to/how-to-profile-memory.md
Co-authored-by: shuiyisong <113876041+shuiyisong@users.noreply.github.com>
* fix: typo in docs
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
Co-authored-by: shuiyisong <113876041+shuiyisong@users.noreply.github.com>
* feat: adds format, regex_extract function and more type tests
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: forgot functions
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: forgot null type
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* test: forgot date type
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: remove format function
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* test: update results after upgrading datafusion
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* perf: only decode primary keys in the batch
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: don't push none to creator
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: implement method to filter __table_id for sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: filter table id for sparse encoding separately
The __table_id doesn't present in projection so we have to filter it
manually
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: decode tags for sparse encoding when building bloom filter
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support inverted index for tags under sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: skip tag columns in fulltext index
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: fix warnings
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: fix list index metadata test
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: decode primary key columns to filter
When primary key columns are not in projection but in filters, we need
to decode them in compute_filter_mask_flat
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: reuse filter method
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: only use dictionary for string type in compat
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: safe to get column by creator's column id
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: support flat in basic_test
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: support flat in alter_test
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: support flat for append_mode_test
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update bump_committed_sequence_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update close_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update compaction_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update create_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update edit_region_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update merge_mode_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update parallel_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update projection_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update prune_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update row_selector_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update scan_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update drop_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update filter_deleted_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update sync_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update set_role_state_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update staging_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update truncate_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update catchup_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update flush_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update open_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update batch_open_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: fix all flat format tests
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat/manual-compaction-parallelism:
### Add Parallelism Support to Compaction Requests
- **`Cargo.lock` & `Cargo.toml`**: Updated `greptime-proto` dependency to a new revision.
- **`flush_compact_table.rs`**: Enhanced `parse_compact_params` to support a new `parallelism` parameter, allowing users to
specify the level of parallelism for table compaction.
- **`handle_compaction.rs`**: Integrated `parallelism` into the compaction scheduling process, defaulting to 1 if not
specified.
- **`request.rs` & `region_request.rs`**: Modified `CompactRequest` to include `parallelism`, with logic to handle unspecifie
values.
- **`requests.rs`**: Updated `CompactTableRequest` structure to include an optional `parallelism` field.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/manual-compaction-parallelism:
### Commit Message
Enhance Compaction Request Handling
- **`flush_compact_table.rs`**:
- Renamed `parse_compact_params` to `parse_compact_request`.
- Introduced `DEFAULT_COMPACTION_PARALLELISM` constant.
- Updated parsing logic to handle keyword arguments for `strict_window` and `regular` compaction types, including `parallelism` and `window`.
- Modified tests to reflect changes in parsing logic and default parallelism handling.
- **`request.rs`**:
- Updated `parallelism` handling in `RegionRequestBody::Compact` to use the new default value.
- **`requests.rs`**:
- Changed `CompactTableRequest` to use a non-optional `parallelism` field with a default value of 1.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/manual-compaction-parallelism:
### Update `flush_compact_table.rs` Parameter Validation
- Modified parameter validation in `flush_compact_table.rs` to restrict the maximum number of parameters from 4 to 3 in the `parse_compact_request` function.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/manual-compaction-parallelism:
Update `greptime-proto` dependency
- Updated the `greptime-proto` dependency to a new revision in both `Cargo.lock` and `Cargo.toml`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: struct value
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: update for proto module
* feat: wip struct type
* feat: implement more vector operations
* feat: make datatype and api
* feat: reoslve some compilation issues
* feat: resolve all compilation issues
* chore: format update
* test: resolve tests
* test: test and refactor value-to-pb
* feat: add more tests and fix for value types
* chore: remove dbg
* feat: test and fix iterator
* fix: resolve struct_type issue
* feat: pgwire 0.33 update
* refactor: use vec for struct items
* feat: conversion from json to value
* feat: add decode function
* fix: lint issue
* feat: update how we encode raw data
* feat: add convertion to fully strcutured StructValue
* refactor: take owned value in all encode/decode functions
* feat: add pg serialization of structvalue
* chore: toml format
* refactor: adopt new and try_new from struct value
* chore: cleanup residual issues
* docs: docs up
* fix lint issue
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* chore: address review comment especially collection capacity
* refactor: remove unneeded processed keys collection
* feat: Value::Json type
* chore: add some work in progress changes
* feat: adopt new json type
* refactor: limit scope json conversion functions
* fix: self review update
* test: provide tests for value::json
* test: add tests for api/helper
* switch proto to main branch
* fix: implement is_null for ValueRef::Json
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* feat: add updated_on to tablemeta with a default of created_on
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: support the update_on on alter procedure
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: add updated_on into information_schema.tables
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* fix: make sqlness happy
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* test: add test case for tablemeta update
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* fix: fix failing test for ALTER TABLE
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: use created_on as default for updated_on when missing
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
---------
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: struct value
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: update for proto module
* feat: wip struct type
* feat: implement more vector operations
* feat: make datatype and api
* feat: reoslve some compilation issues
* feat: resolve all compilation issues
* chore: format update
* test: resolve tests
* test: test and refactor value-to-pb
* feat: add more tests and fix for value types
* chore: remove dbg
* feat: test and fix iterator
* fix: resolve struct_type issue
* feat: pgwire 0.33 update
* refactor: use vec for struct items
* feat: conversion from json to value
* feat: add decode function
* fix: lint issue
* feat: update how we encode raw data
* feat: add convertion to fully strcutured StructValue
* refactor: take owned value in all encode/decode functions
* feat: add pg serialization of structvalue
* chore: toml format
* refactor: adopt new and try_new from struct value
* chore: cleanup residual issues
* docs: docs up
* fix lint issue
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* chore: address review comment especially collection capacity
* refactor: remove unneeded processed keys collection
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* feat: struct value
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: update for proto module
* feat: wip struct type
* feat: implement more vector operations
* feat: make datatype and api
* feat: reoslve some compilation issues
* feat: resolve all compilation issues
* chore: format update
* test: resolve tests
* test: test and refactor value-to-pb
* feat: add more tests and fix for value types
* chore: remove dbg
* feat: test and fix iterator
* fix: resolve struct_type issue
* refactor: use vec for struct items
* chore: update proto to main branch
* refactor: address some of review issues
* refactor: update for further review
* Add validation on new methods
* feat: update struct/list json serialization
* refactor: reimplement get in struct_vector
* refactor: struct vector functions
* refactor: fix lint issue
* refactor: address review comments
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: align influxdb line timestamp with table time index
Signed-off-by: luofucong <luofc@foxmail.com>
* fix ci
Signed-off-by: luofucong <luofc@foxmail.com>
---------
Signed-off-by: luofucong <luofc@foxmail.com>
* tests: fix unit test by passing one sort columns
Signed-off-by: discord9 <discord9@163.com>
* chore: per copilot
Signed-off-by: discord9 <discord9@163.com>
---------
Signed-off-by: discord9 <discord9@163.com>
* refactor: cleanup datafusion-pg-catalog dependencies
Signed-off-by: Ning Sun <sunning@greptime.com>
* chore: toml format
* feat: update upstream
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* fix: not applied
Signed-off-by: discord9 <discord9@163.com>
* chore: per review
Signed-off-by: discord9 <discord9@163.com>
* test: confirm order by not push down
Signed-off-by: discord9 <discord9@163.com>
---------
Signed-off-by: discord9 <discord9@163.com>
* feat: use correct projection index for old format
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: remove allow dead_code from format
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: check and convert old format to flat format
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: sub primary key num from projection
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: always convert the batch in FlatRowGroupReader
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: Change &Option<&[]> to Option<&[]>
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: only build arrow schema once
adds a method flat_sst_arrow_schema_column_num() to get the field num
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: Handle flat format and old format separately
Adds two structs ParquetFlat and ParquetPrimaryKeyToFlat.
ParquetPrimaryKeyToFlat delegates stats and projection to the
PrimaryKeyReadFormat.
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: handle non string tag correctly
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: do not register file cache twice
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: clean temp files
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: add rows and bytes to flush success log
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: convert format in memtable
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: add compaction flag to ScanInput
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: compaction should use old format for sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: merge schema use old format in sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: reads legacy format but not convert if skip_auto_convert
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: suppport sparse encoding in bulk parts
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: add datafusion-postgres dependency
* refactor: move and include pg_catalog udfs
* chore: update upstream
* feat: register table function pg_get_keywords
* feat: bridge CatalogInfo for our CatalogManager
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: convert pg_catalog table to our system table
* feat: bridge system catalog with datafusion-postgres
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: add more udfs
* feat: add compatibility rewriter to postgres handler
* fix: various fix
* fmt: fix
* fix: use functions from pg_catalog library
* fmt
* fix: sqlness runner
Signed-off-by: Ning Sun <sunning@greptime.com>
* test: adopt arrow 56.0 to 56.1 memory size change
* fix: add additional udfs
* chore: format
* refactor: return None when creating system table failed
Signed-off-by: Ning Sun <sunning@greptime.com>
* chore: provide safety comments about expect usage
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
chore/unset-tz-env-in-test:
### Commit Message
Add environment variable cleanup in timezone tests
- Updated `timezone.rs` to include removal of the `TZ` environment variable in the `test_from_tz_string` function to ensure a clean test environment.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
fix/disable-parquet-stats-truncate:
- **Update `memcomparable` Dependency**: Switched from crates.io to a Git repository for `memcomparable` in `Cargo.lock`, `mito-codec/Cargo.toml`, and removed it from `mito2/Cargo.toml`.
- **Enhance Parquet Writer Properties**: Added `set_statistics_truncate_length` and `set_column_index_truncate_length` to `WriterProperties` in `parquet.rs`, `bulk/part.rs`, `partition_tree/data.rs`, and `writer.rs`.
- **Add Test for Corrupt Scan**: Introduced a new test module `scan_corrupt.rs` in `mito2/src/engine` to verify handling of corrupt data.
- **Update Test Data**: Modified test data in `flush.rs` to reflect changes in file sizes and sequences.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
### Commit Message
Refactor `get_last_seq_num` Method Across Engines
- **Change Return Type**: Updated the `get_last_seq_num` method to return `Result<SequenceNumber, BoxedError>` instead of `Result<Option<SequenceNumber>, BoxedError>` in the following files:
- `src/datanode/src/tests.rs`
- `src/file-engine/src/engine.rs`
- `src/metric-engine/src/engine.rs`
- `src/metric-engine/src/engine/read.rs`
- `src/mito2/src/engine.rs`
- `src/query/src/optimizer/test_util.rs`
- `src/store-api/src/region_engine.rs`
- **Enhance Region Edit Handling**: Modified `RegionWorkerLoop` in `src/mito2/src/worker/handle_manifest.rs` to update file sequences during region edits.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* add committed_sequence to RegionEdit
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
### Commit Message
Refactor sequence retrieval method
- **Renamed Method**: Changed `get_last_seq_num` to `get_committed_sequence` across multiple files to better reflect its purpose of retrieving the latest committed sequence.
- Affected files: `tests.rs`, `engine.rs` in `file-engine`, `metric-engine`, `mito2`, `test_util.rs`, and `region_engine.rs`.
- **Removed Unused Struct**: Deleted `RegionSequencesRequest` struct from `region_request.rs` as it is no longer needed.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
**Add Committed Sequence Handling in Region Engine**
- **`engine.rs`**: Introduced a new test module `bump_committed_sequence_test` to verify committed sequence handling.
- **`bump_committed_sequence_test.rs`**: Added a test to ensure the committed sequence is correctly updated and persisted across region reopenings.
- **`action.rs`**: Updated `RegionManifest` and `RegionManifestBuilder` to include `committed_sequence` for tracking.
- **`manager.rs`**: Adjusted manifest size assertion to accommodate new committed sequence data.
- **`opener.rs`**: Implemented logic to override committed sequence during region opening.
- **`version.rs`**: Added `set_committed_sequence` method to update the committed sequence in `VersionControl`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
**Enhance `test_bump_committed_sequence` in `bump_committed_sequence_test.rs`**
- Updated the test to include row operations using `build_rows`, `put_rows`, and `rows_schema` to verify the committed sequence behavior.
- Adjusted assertions to reflect changes in committed sequence after row operations and region edits.
- Added comments to clarify the expected behavior of committed sequence after reopening the region and replaying the WAL.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
**Enhance Region Sequence Management**
- **`bump_committed_sequence_test.rs`**: Updated test to handle region reopening and sequence management, ensuring committed sequences are correctly set and verified after edits.
- **`opener.rs`**: Improved committed sequence handling by overriding it only if the manifest's sequence is greater than the replayed sequence. Added logging for mutation sequence replay.
- **`region_write_ctx.rs`**: Modified `push_mutation` and `push_bulk` methods to adopt sequence numbers from parameters, enhancing sequence management during write operations.
- **`handle_write.rs`**: Updated `RegionWorkerLoop` to pass sequence numbers in `push_bulk` and `push_mutation` methods, ensuring consistent sequence handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
### Remove Debug Logging from `opener.rs`
- Removed debug logging for mutation sequences in `opener.rs` to clean up the output and improve performance.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* test: migrate join tests
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: update test results after rebasing main branch
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: unstable query sort results and natural_join test
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: count(*) with joining
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: unstable query sort results and style
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
The allocated and resident metrics were swapped in the set calls. This commit
fixes the issue by ensuring each metric receives its corresponding value.
* feat: support flat format in SeqScan
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support flat format in unordered scan
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support parallel read for flat format in SeqScan
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: rename flat DedupReader to FlatDedupReader
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: address review comments
It also precomputes the input arrow schema
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: add FlushRegionsV2 instruction with unified semantics
- Add FlushRegionsV2 struct supporting both single and batch operations
- Preserve original FlushRegion(RegionId) API for backward compatibility
- Support configurable FlushStrategy (Sync/Async) and FlushErrorStrategy (FailFast/TryAll)
- Add detailed per-region error reporting in FlushRegionReply
- Update datanode handlers to support both legacy and enhanced flush instructions
- Maintain zero breaking changes through automatic conversion of legacy formats
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* chore: run make fmt
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* Apply suggestions from code review
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* refactor: extract shared perform_region_flush fn
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* refactor: use consistent error type across similar methods
see gh copilot suggestion: https://github.com/GreptimeTeam/greptimedb/pull/6819#discussion_r2299603698
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* chore: make fmt
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* refactor: consolidate FlushRegion instructions
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
---------
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* feat: implements method to write flat batch for ParquetWriter
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: add update method for flat RecordBatch in Indexer
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: calls indexer to write flat batch in ParquetWriter
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: handle empty projection for flat format
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: eval array in precise_filter_flat
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: cache column lookup result in inverted indexer
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: add test
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support dict type in dense codec
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: remove read part in test as it need modifying the reader
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support dictionary type in other methods for dense codec
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: fulltext use string array directly
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore/change-encode-raw-values-sig:
### Update Sparse Encoding to Use Byte Slices
- **`bench_sparse_encoding.rs`**: Modified the `encode_raw_tag_value` function to use byte slices instead of `Bytes` for tag values.
- **`sparse.rs`**: Updated the `encode_raw_tag_value` method in `SparsePrimaryKeyCodec` to accept byte slices (`&[u8]`) instead of `Bytes`. Adjusted related test cases to reflect this change.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/change-encode-raw-values-sig:
### Add `Clear` Trait Implementation for Byte Slices
- Implemented the `Clear` trait for byte slices (`&[u8]`) in `repeated_field.rs` to enhance trait coverage and provide a default clear operation for byte slice types.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: update rate limiter to use semaphore that will block without return error
Signed-off-by: Ning Sun <sunning@greptime.com>
* fix: remove unused error
Signed-off-by: Ning Sun <sunning@greptime.com>
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: support different key type for the dictionary vector
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support more dictionary type in try_into_vector
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: use key array's type as key type
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: cache the cloned page bytes
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: cache the whole row group pages
The opendal reader may merge IO requests so the pages of different
columns can share the same Bytes.
When we use a per-column page cache, the page cache may still referencing
the whole Bytes after eviction if there are other columns in the cache that
share the same Bytes.
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: check possible max byte range and copy pages if needed
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: always copy pages
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: returns the copied pages
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: compute cache size by MERGE_GAP
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: align to buf size
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: aligh to 2MB
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: remove unused code
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: fix typo
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: fix parquet read with cache test
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: Add support for TWCS time window hints in insert operations
Signed-off-by: WenyXu <wenymedia@gmail.com>
* feat: set system events table time window to 1d
Signed-off-by: WenyXu <wenymedia@gmail.com>
---------
Signed-off-by: WenyXu <wenymedia@gmail.com>
* perf/sparse-encoder:
- **Update Dependencies**: Updated `criterion-plot` to version `0.5.0` and added `criterion` version `0.7.0` in `Cargo.lock`. Added `bytes` to `Cargo.toml` in `src/metric-engine`.
- **Benchmarking**: Added a new benchmark for sparse encoding in `bench_sparse_encoding.rs` and updated `Cargo.toml` in `src/mito-codec` to include `criterion` as a dev-dependency.
- **Sparse Encoding Enhancements**: Modified `SparsePrimaryKeyCodec` in `sparse.rs` to include new methods `encode_raw_tag_value` and `encode_internal`. Added public constants `RESERVED_COLUMN_ID_TSID` and `RESERVED_COLUMN_ID_TABLE_ID`.
- **HTTP Server**: Made `try_decompress` function public in `prom_store.rs`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* perf/sparse-encoder:
Improve buffer handling in `sparse.rs`
- Refactored buffer reservation logic to use `value_len` for clarity.
- Optimized chunk processing by calculating `num_chunks` and `remainder` for efficient data handling.
- Enhanced manual serialization of bytes to avoid byte-by-byte operations, improving performance.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* Update src/mito-codec/src/row_converter/sparse.rs
Co-authored-by: Yingwen <realevenyag@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
Co-authored-by: Yingwen <realevenyag@gmail.com>
* test: add failing test for #6791
* test: add support for = and =~
* fix: lint
* fix: code merge issue
Signed-off-by: Ning Sun <sunning@greptime.com>
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* chore: improve error message when there are more than one time index
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: style
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: optimize CreateFlowData with lightweight FlowQueryContext
Replace full QueryContext with FlowQueryContext containing only essential fields (catalog, schema, timezone) in CreateFlowData struct. This improves serialization performance by eliminating unused extensions HashMap and channel fields.
Key changes:
- Add FlowQueryContext struct with conversion implementations
- Update CreateFlowData to use FlowQueryContext with backward compatibility
- Add tests for serialization and conversions
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* chore: run make fmt
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
---------
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* refactor: use DataFusion's UDAF implementation directly
Signed-off-by: luofucong <luofc@foxmail.com>
* remove: delete how-to guide for writing aggregate functions
Signed-off-by: luofucong <luofc@foxmail.com>
* fix ci
Signed-off-by: luofucong <luofc@foxmail.com>
* refactor: port json_encode_path to datafusion udaf
Signed-off-by: Ning Sun <sunning@greptime.com>
---------
Signed-off-by: luofucong <luofc@foxmail.com>
Signed-off-by: Ning Sun <sunning@greptime.com>
Co-authored-by: Ning Sun <sunning@greptime.com>
* feat: Implements FlatLastNonNull strategy
Dedup rows and keep last non null fields
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: add basic test for FlatLastNonNull
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: port more last non null test
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: do merge rows after delete op
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: rename num_pk_columns to field_column_start
So we can support different format later
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: address comment
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
@@ -55,14 +55,18 @@ GreptimeDB uses the [Apache 2.0 license](https://github.com/GreptimeTeam/greptim
- To ensure that community is free and confident in its ability to use your contributions, please sign the Contributor License Agreement (CLA) which will be incorporated in the pull request process.
- Make sure all files have proper license header (running `docker run --rm -v $(pwd):/github/workspace ghcr.io/korandoru/hawkeye-native:v3 format` from the project root).
- Make sure all your codes are formatted and follow the [coding style](https://pingcap.github.io/style-guide/rust/) and [style guide](docs/style-guide.md).
- Make sure all unit tests are passed using [nextest](https://nexte.st/index.html) `cargo nextest run`.
- Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings`).
- Make sure all unit tests are passed using [nextest](https://nexte.st/index.html) `cargo nextest run --workspace --features pg_kvbackend,mysql_kvbackend` or `make test`.
- Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings` or `make clippy`).
- Ensure there are no unused dependencies by running `make check-udeps` (clean them up with `make fix-udeps` if reported).
- If you must keep a target-specific dependency (e.g. under `[target.'cfg(...)'.dev-dependencies]`), add a cargo-udeps ignore entry in the same `Cargo.toml`, for example:
`[package.metadata.cargo-udeps.ignore]` with `development = ["rexpect"]` (or `dependencies`/`build` as appropriate).
- When modifying sample configuration files in `config/`, run `make config-docs` (which requires Docker to be installed) to update the configuration documentation and include it in your commit.
#### `pre-commit` Hooks
You could setup the [`pre-commit`](https://pre-commit.com/#plugins) hooks to run these checks on every commit automatically.
1. Install `pre-commit`
1. Install `pre-commit`
pip install pre-commit
@@ -70,7 +74,7 @@ You could setup the [`pre-commit`](https://pre-commit.com/#plugins) hooks to run
**GreptimeDB** is an open-source, cloud-native database purpose-built for the unified collection and analysis of observability data (metrics, logs, and traces). Whether you’re operating on the edge, in the cloud, or across hybrid environments, GreptimeDB empowers real-time insights at massive scale — all in one system.
**GreptimeDB** is an open-source, cloud-native database that unifies metrics, logs, and traces, enabling real-time observability at any scale — across edge, cloud, and hybrid environments.
## Features
| Feature | Description |
| --------- | ----------- |
| [Unified Observability Data](https://docs.greptime.com/user-guide/concepts/why-greptimedb) | Store metrics, logs, and traces as timestamped, contextual wide events. Query via [SQL](https://docs.greptime.com/user-guide/query-data/sql), [PromQL](https://docs.greptime.com/user-guide/query-data/promql), and [streaming](https://docs.greptime.com/user-guide/flow-computation/overview). |
| [High Performance & Cost Effective](https://docs.greptime.com/user-guide/manage-data/data-index) | Written in Rust, with a distributed query engine, [rich indexing](https://docs.greptime.com/user-guide/manage-data/data-index), and optimized columnar storage, delivering sub-second responses at PB scale. |
| [Cloud-Native Architecture](https://docs.greptime.com/user-guide/concepts/architecture) | Designed for [Kubernetes](https://docs.greptime.com/user-guide/deployments-administration/deploy-on-kubernetes/greptimedb-operator-management), with compute/storage separation, native object storage (AWS S3, Azure Blob, etc.) and seamless cross-cloud access. |
| [Developer-Friendly](https://docs.greptime.com/user-guide/protocols/overview) | Access via SQL/PromQL interfaces, REST API, MySQL/PostgreSQL protocols, and popular ingestion [protocols](https://docs.greptime.com/user-guide/protocols/overview). |
| [Flexible Deployment](https://docs.greptime.com/user-guide/deployments-administration/overview) | Deploy anywhere: edge (including ARM/[Android](https://docs.greptime.com/user-guide/deployments-administration/run-on-android)) or cloud, with unified APIs and efficient data sync. |
| [All-in-One Observability](https://docs.greptime.com/user-guide/concepts/why-greptimedb) | OpenTelemetry-native platform unifying metrics, logs, and traces. Query via [SQL](https://docs.greptime.com/user-guide/query-data/sql), [PromQL](https://docs.greptime.com/user-guide/query-data/promql), and [Flow](https://docs.greptime.com/user-guide/flow-computation/overview). |
| [High Performance](https://docs.greptime.com/user-guide/manage-data/data-index) | Written in Rust with [rich indexing](https://docs.greptime.com/user-guide/manage-data/data-index) (inverted, fulltext, skipping, vector), delivering sub-second responses at PB scale. |
| [Cost Efficiency](https://docs.greptime.com/user-guide/concepts/architecture) | 50x lower operational and storage costs with compute-storage separation and native object storage (S3, Azure Blob, etc.). |
| [Cloud-Native & Scalable](https://docs.greptime.com/user-guide/deployments-administration/deploy-on-kubernetes/greptimedb-operator-management) | Purpose-built for [Kubernetes](https://docs.greptime.com/user-guide/deployments-administration/deploy-on-kubernetes/greptimedb-operator-management) with unlimited cross-cloud scaling, handling hundreds of thousands of concurrent requests. |
| [Developer-Friendly](https://docs.greptime.com/user-guide/protocols/overview) | SQL/PromQL interfaces, built-in web dashboard, REST API, MySQL/PostgreSQL protocol compatibility, and native [OpenTelemetry](https://docs.greptime.com/user-guide/ingest-data/for-observability/opentelemetry/) support. |
| [Flexible Deployment](https://docs.greptime.com/user-guide/deployments-administration/overview) | Deploy anywhere from ARM-based edge devices (including [Android](https://docs.greptime.com/user-guide/deployments-administration/run-on-android)) to cloud, with unified APIs and efficient data sync. |
✅ **Perfect for:**
- Unified observability stack replacing Prometheus + Loki + Tempo
- Large-scale metrics with high cardinality (millions to billions of time series)
- Large-scale observability platform requiring cost efficiency and scalability
- IoT and edge computing with resource and bandwidth constraints
Learn more in [Why GreptimeDB](https://docs.greptime.com/user-guide/concepts/why-greptimedb) and [Observability 2.0 and the Database for It](https://greptime.com/blogs/2025-04-25-greptimedb-observability2-new-database).
@@ -86,10 +92,10 @@ Learn more in [Why GreptimeDB](https://docs.greptime.com/user-guide/concepts/why
* Read the [architecture](https://docs.greptime.com/contributor-guide/overview/#architecture) document.
* [DeepWiki](https://deepwiki.com/GreptimeTeam/greptimedb/1-overview) provides an in-depth look at GreptimeDB:
GreptimeDB can run in two modes:
* **Standalone Mode** - Single binary for development and small deployments
* **Distributed Mode** - Separate components for production scale:
- Frontend: Query processing and protocol handling
- Datanode: Data storage and retrieval
- Metasrv: Metadata management and coordination
Read the [architecture](https://docs.greptime.com/contributor-guide/overview/#architecture) document. [DeepWiki](https://deepwiki.com/GreptimeTeam/greptimedb/1-overview) provides an in-depth look at GreptimeDB:
<imgalt="GreptimeDB System Overview"src="docs/architecture.png">
- **Grafana Data Source**: [GreptimeDB Grafana data source plugin](https://github.com/GreptimeTeam/greptimedb-grafana-datasource)
- **Grafana Dashboard**: [Official Dashboard for monitoring](https://github.com/GreptimeTeam/greptimedb/blob/main/grafana/README.md)
## Project Status
> **Status:** Beta.
> **GA (v1.0):** Targeted for mid 2025.
> **Status:** Beta — marching toward v1.0 GA!
> **GA (v1.0):** January 10, 2026
-Being used in production by early adopters
-Deployed in production by open-source projects and commercial users
- Stable, actively maintained, with regular releases ([version info](https://docs.greptime.com/nightly/reference/about-greptimedb-version))
- Suitable for evaluation and pilot deployments
GreptimeDB v1.0 represents a major milestone toward maturity — marking stable APIs, production readiness, and proven performance.
**Roadmap:** Beta1 (Nov 10) → Beta2 (Nov 24) → RC1 (Dec 8) → GA (Jan 10, 2026), please read [v1.0 highlights and release plan](https://greptime.com/blogs/2025-11-05-greptimedb-v1-highlights) for details.
For production use, we recommend using the latest stable release.
[](https://www.star-history.com/#GreptimeTeam/GreptimeDB&Date)
@@ -214,5 +222,5 @@ Special thanks to all contributors! See [AUTHORS.md](https://github.com/Greptime
| `default_timezone` | String | Unset | The default timezone of the server. |
| `default_column_prefix` | String | Unset | The default column prefix for auto-created time index and value columns. |
| `init_regions_in_background` | Bool | `false` | Initialize all regions in the background during the startup.<br/>By default, it provides services after all regions have been initialized. |
| `max_concurrent_queries` | Integer | `0` | The maximum current queries allowed to be executed. Zero means unlimited. |
| `max_concurrent_queries` | Integer | `0` | The maximum current queries allowed to be executed. Zero means unlimited.<br/>NOTE: This setting affects scan_memory_limit's privileged tier allocation.<br/>When set, 70% of queries get privileged memory access (full scan_memory_limit).<br/>The remaining 30% get standard tier access (70% of scan_memory_limit). |
| `enable_telemetry` | Bool | `true` | Enable telemetry to collect anonymous usage data. Enabled by default. |
| `max_in_flight_write_bytes` | String | Unset | The maximum in-flight write bytes. |
| `runtime` | -- | -- | The runtime options. |
@@ -25,12 +26,15 @@
| `http.addr` | String | `127.0.0.1:4000` | The address to bind the HTTP server. |
| `http.timeout` | String | `0s` | HTTP request timeout. Set to 0 to disable timeout. |
| `http.body_limit` | String | `64MB` | HTTP request body limit.<br/>The following units are supported: `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`.<br/>Set to 0 to disable limit. |
| `http.max_total_body_memory` | String | Unset | Maximum total memory for all concurrent HTTP request bodies.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
| `http.enable_cors` | Bool | `true` | HTTP CORS support, it's turned on by default<br/>This allows browser to access http APIs without CORS restrictions |
| `http.prom_validation_mode` | String | `strict` | Whether to enable validation for Prometheus remote write requests.<br/>Available options:<br/>- strict: deny invalid UTF-8 strings (default).<br/>- lossy: allow invalid UTF-8 strings, replace invalid characters with REPLACEMENT_CHARACTER(U+FFFD).<br/>- unchecked: do not valid strings. |
| `grpc` | -- | -- | The gRPC server options. |
| `grpc.bind_addr` | String | `127.0.0.1:4001` | The address to bind the gRPC server. |
| `grpc.runtime_size` | Integer | `8` | The number of server worker threads. |
| `grpc.max_total_message_memory` | String | Unset | Maximum total memory for all concurrent gRPC request messages.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
| `grpc.max_connection_age` | String | Unset | The maximum connection age for gRPC connection.<br/>The value can be a human-readable time string. For example: `10m` for ten minutes or `1h` for one hour.<br/>Refer to https://grpc.io/docs/guides/keepalive/ for more details. |
| `grpc.tls` | -- | -- | gRPC server TLS options, see `mysql.tls` section. |
| `wal.broker_endpoints` | Array | -- | The Kafka broker endpoints.<br/>**It's only used when the provider is `kafka`**. |
| `wal.connect_timeout` | String | `3s` | The connect timeout for kafka client.<br/>**It's only used when the provider is `kafka`**. |
| `wal.timeout` | String | `3s` | The timeout for kafka client.<br/>**It's only used when the provider is `kafka`**. |
| `wal.auto_create_topics` | Bool | `true` | Automatically create topics for WAL.<br/>Set to `true` to automatically create topics for WAL.<br/>Otherwise, use topics named `topic_name_prefix_[0..num_topics)` |
| `wal.num_topics` | Integer | `64` | Number of topics.<br/>**It's only used when the provider is `kafka`**. |
| `wal.selector_type` | String | `round_robin` | Topic selector type.<br/>Available selector types:<br/>- `round_robin` (default)<br/>**It's only used when the provider is `kafka`**. |
@@ -99,11 +106,10 @@
| `flow.num_workers` | Integer | `0` | The number of flow worker in flownode.<br/>Not setting(or set to 0) this value will use the number of CPU cores divided by 2. |
| `query` | -- | -- | The query engine options. |
| `query.parallelism` | Integer | `0` | Parallelism of the query engine.<br/>Default to 0, which means the number of CPU cores. |
| `query.memory_pool_size` | String | `50%` | Memory pool size for query execution operators (aggregation, sorting, join).<br/>Supports absolute size (e.g., "2GB", "4GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit (unbounded, default behavior).<br/>When this limit is reached, queries will fail with ResourceExhausted error.<br/>NOTE: This does NOT limit memory used by table scans. |
| `storage` | -- | -- | The data storage options. |
| `storage.data_home` | String | `./greptimedb_data` | The working home directory. |
| `storage.type` | String | `File` | The storage type used to store the data.<br/>- `File`: the data is stored in the local file system.<br/>- `S3`: the data is stored in the S3 object storage.<br/>- `Gcs`: the data is stored in the Google Cloud Storage.<br/>- `Azblob`: the data is stored in the Azure Blob Storage.<br/>- `Oss`: the data is stored in the Aliyun OSS. |
| `storage.cache_path` | String | Unset | Read cache configuration for object storage such as 'S3' etc, it's configured by default when using object storage. It is recommended to configure it when using object storage for better performance.<br/>A local file directory, defaults to `{data_home}`. An empty string means disabling. |
| `storage.cache_capacity` | String | Unset | The local file cache capacity in bytes. If your disk space is sufficient, it is recommended to set it larger. |
| `storage.bucket` | String | Unset | The S3 bucket name.<br/>**It's only used when the storage type is `S3`, `Oss` and `Gcs`**. |
| `storage.root` | String | Unset | The S3 data will be stored in the specified prefix, for example, `s3://${bucket}/${root}`.<br/>**It's only used when the storage type is `S3`, `Oss` and `Azblob`**. |
| `storage.access_key_id` | String | Unset | The access key id of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3` and `Oss`**. |
@@ -134,6 +140,8 @@
| `region_engine.mito.max_background_flushes` | Integer | Auto | Max number of running background flush jobs (default: 1/2 of cpu cores). |
| `region_engine.mito.max_background_compactions` | Integer | Auto | Max number of running background compaction jobs (default: 1/4 of cpu cores). |
| `region_engine.mito.max_background_purges` | Integer | Auto | Max number of running background purge jobs (default: number of cpu cores). |
| `region_engine.mito.experimental_compaction_memory_limit` | String | 0 | Memory budget for compaction tasks. Setting it to 0 or "unlimited" disables the limit. |
| `region_engine.mito.experimental_compaction_on_exhausted` | String | wait | Behavior when compaction cannot acquire memory from the budget.<br/>Options: "wait" (default, 10s), "wait(<duration>)", "fail" |
| `region_engine.mito.auto_flush_interval` | String | `1h` | Interval to auto flush a region if it has not flushed yet. |
| `region_engine.mito.global_write_buffer_size` | String | Auto | Global write buffer size for all regions. If not set, it's default to 1/8 of OS memory with a max limitation of 1GB. |
| `region_engine.mito.global_write_buffer_reject_size` | String | Auto | Global write buffer size threshold to reject write requests. If not set, it's default to 2 times of `global_write_buffer_size`. |
@@ -145,11 +153,17 @@
| `region_engine.mito.write_cache_path` | String | `""` | File system path for write cache, defaults to `{data_home}`. |
| `region_engine.mito.write_cache_size` | String | `5GiB` | Capacity for write cache. If your disk space is sufficient, it is recommended to set it larger. |
| `region_engine.mito.preload_index_cache` | Bool | `true` | Preload index (puffin) files into cache on region open (default: true).<br/>When enabled, index files are loaded into the write cache during region initialization,<br/>which can improve query performance at the cost of longer startup times. |
| `region_engine.mito.index_cache_percent` | Integer | `20` | Percentage of write cache capacity allocated for index (puffin) files (default: 20).<br/>The remaining capacity is used for data (parquet) files.<br/>Must be between 0 and 100 (exclusive). For example, with a 5GiB write cache and 20% allocation,<br/>1GiB is reserved for index files and 4GiB for data files. |
| `region_engine.mito.parallel_scan_channel_size` | Integer | `32` | Capacity of the channel to send data from parallel scan tasks to the main task. |
| `region_engine.mito.max_concurrent_scan_files` | Integer | `128` | Maximum number of SST files to scan concurrently. |
| `region_engine.mito.max_concurrent_scan_files` | Integer | `384` | Maximum number of SST files to scan concurrently. |
| `region_engine.mito.allow_stale_entries` | Bool | `false` | Whether to allow stale WAL entries read during replay. |
| `region_engine.mito.scan_memory_limit` | String | `50%` | Memory limit for table scans across all queries.<br/>Supports absolute size (e.g., "2GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit.<br/>NOTE: Works with max_concurrent_queries for tiered memory allocation.<br/>- If max_concurrent_queries is set: 70% of queries get full access, 30% get 70% access.<br/>- If max_concurrent_queries is 0 (unlimited): first 20 queries get full access, rest get 70% access. |
| `region_engine.mito.min_compaction_interval` | String | `0m` | Minimum time interval between two compactions.<br/>To align with the old behavior, the default value is 0 (no restrictions). |
| `region_engine.mito.default_experimental_flat_format` | Bool | `false` | Whether to enable experimental flat format as the default format. |
| `region_engine.mito.index` | -- | -- | The options for index in Mito engine. |
| `region_engine.mito.index.aux_path` | String | `""` | Auxiliary directory path for the index in filesystem, used to store intermediate files for<br/>creating the index and staging files for searching the index, defaults to `{data_home}/index_intermediate`.<br/>The default name for this directory is `index_intermediate` for backward compatibility.<br/><br/>This path contains two subdirectories:<br/>- `__intm`: for storing intermediate files used during creating index.<br/>- `staging`: for storing staging files used during searching index. |
| `region_engine.mito.index.staging_size` | String | `2GB` | The max capacity of the staging directory. |
@@ -181,31 +195,24 @@
| `region_engine.mito.memtable.fork_dictionary_bytes` | String | `1GiB` | Max dictionary bytes.<br/>Only available for `partition_tree` memtable. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThestandalonecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.`greptimedb`)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.self_import` | -- | -- | For `standalone` mode, `self_import` is recommended to collect metrics generated by itself<br/>You must create the database before enabling it. |
| `export_metrics.remote_write.url` | String | `""` | The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
| `default_timezone` | String | Unset | The default timezone of the server. |
| `default_column_prefix` | String | Unset | The default column prefix for auto-created time index and value columns. |
| `max_in_flight_write_bytes` | String | Unset | The maximum in-flight write bytes. |
| `runtime` | -- | -- | The runtime options. |
| `runtime.global_rt_size` | Integer | `8` | The number of threads to execute the runtime for global read operations. |
@@ -230,6 +238,7 @@
| `http.addr` | String | `127.0.0.1:4000` | The address to bind the HTTP server. |
| `http.timeout` | String | `0s` | HTTP request timeout. Set to 0 to disable timeout. |
| `http.body_limit` | String | `64MB` | HTTP request body limit.<br/>The following units are supported: `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`.<br/>Set to 0 to disable limit. |
| `http.max_total_body_memory` | String | Unset | Maximum total memory for all concurrent HTTP request bodies.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
| `http.enable_cors` | Bool | `true` | HTTP CORS support, it's turned on by default<br/>This allows browser to access http APIs without CORS restrictions |
| `http.prom_validation_mode` | String | `strict` | Whether to enable validation for Prometheus remote write requests.<br/>Available options:<br/>- strict: deny invalid UTF-8 strings (default).<br/>- lossy: allow invalid UTF-8 strings, replace invalid characters with REPLACEMENT_CHARACTER(U+FFFD).<br/>- unchecked: do not valid strings. |
@@ -237,17 +246,30 @@
| `grpc.bind_addr` | String | `127.0.0.1:4001` | The address to bind the gRPC server. |
| `grpc.server_addr` | String | `127.0.0.1:4001` | The address advertised to the metasrv, and used for connections from outside the host.<br/>If left empty or unset, the server will automatically use the IP address of the first network interface<br/>on the host, with the same port number as the one specified in `grpc.bind_addr`. |
| `grpc.runtime_size` | Integer | `8` | The number of server worker threads. |
| `grpc.max_total_message_memory` | String | Unset | Maximum total memory for all concurrent gRPC request messages.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
| `grpc.flight_compression` | String | `arrow_ipc` | Compression mode for frontend side Arrow IPC service. Available options:<br/>- `none`: disable all compression<br/>- `transport`: only enable gRPC transport compression (zstd)<br/>- `arrow_ipc`: only enable Arrow IPC compression (lz4)<br/>- `all`: enable all compression.<br/>Default to `none` |
| `grpc.max_connection_age` | String | Unset | The maximum connection age for gRPC connection.<br/>The value can be a human-readable time string. For example: `10m` for ten minutes or `1h` for one hour.<br/>Refer to https://grpc.io/docs/guides/keepalive/ for more details. |
| `grpc.tls` | -- | -- | gRPC server TLS options, see `mysql.tls` section. |
| `grpc.tls.watch` | Bool | `false` | Watch for Certificate and key file change and auto reload.<br/>For now, gRPC tls config does not support auto reload. |
| `internal_grpc` | -- | -- | The internal gRPC server options. Internal gRPC port for nodes inside cluster to access frontend. |
| `internal_grpc.bind_addr` | String | `127.0.0.1:4010` | The address to bind the gRPC server. |
| `internal_grpc.server_addr` | String | `127.0.0.1:4010` | The address advertised to the metasrv, and used for connections from outside the host.<br/>If left empty or unset, the server will automatically use the IP address of the first network interface<br/>on the host, with the same port number as the one specified in `grpc.bind_addr`. |
| `internal_grpc.runtime_size` | Integer | `8` | The number of server worker threads. |
| `internal_grpc.flight_compression` | String | `arrow_ipc` | Compression mode for frontend side Arrow IPC service. Available options:<br/>- `none`: disable all compression<br/>- `transport`: only enable gRPC transport compression (zstd)<br/>- `arrow_ipc`: only enable Arrow IPC compression (lz4)<br/>- `all`: enable all compression.<br/>Default to `none` |
| `internal_grpc.tls` | -- | -- | internal gRPC server TLS options, see `mysql.tls` section. |
| `internal_grpc.tls.watch` | Bool | `false` | Watch for Certificate and key file change and auto reload.<br/>For now, gRPC tls config does not support auto reload. |
| `query.parallelism` | Integer | `0` | Parallelism of the query engine.<br/>Default to 0, which means the number of CPU cores. |
| `query.allow_query_fallback` | Bool | `false` | Whether to allow query fallback when push down optimize fails.<br/>Default to false, meaning when push down optimize failed, return error msg |
| `query.memory_pool_size` | String | `50%` | Memory pool size for query execution operators (aggregation, sorting, join).<br/>Supports absolute size (e.g., "4GB", "8GB") or percentage of system memory (e.g., "30%").<br/>Setting it to 0 disables the limit (unbounded, default behavior).<br/>When this limit is reached, queries will fail with ResourceExhausted error.<br/>NOTE: This does NOT limit memory used by table scans (only applies to datanodes). |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `slow_query.threshold` | String | `30s` | The threshold of slow query. It can be human readable time string, for example: `10s`, `100ms`, `1s`. |
| `slow_query.sample_ratio` | Float | `1.0` | The sampling ratio of slow query log. The value should be in the range of (0, 1]. For example, `0.1` means 10% of the slow queries will be logged and `1.0` means all slow queries will be logged. |
| `slow_query.ttl` | String | `90d` | The TTL of the `slow_queries` system table. Default is `90d` when `record_type` is `system_table`. |
| `export_metrics` | -- | -- | The frontend can export its metrics and send to Prometheus compatible service (e.g. `greptimedb` itself) from remote-write API.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.remote_write` | -- | -- | -- |
| `export_metrics.remote_write.url` | String | `""` | The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
| `tracing` | -- | -- | The tracing options. Only effect when compiled with `tokio-console` feature. |
| `tracing.tokio_console_addr` | String | Unset | The tokio console address. |
| `memory` | -- | -- | The memory options. |
@@ -325,10 +342,11 @@
| Key | Type | Default | Descriptions |
| --- | -----| ------- | ----------- |
| `data_home` | String | `./greptimedb_data` | The working home directory. |
| `store_addrs` | Array | -- | Store server address default to etcd store.<br/>For postgres store, the format is:<br/>"password=password dbname=postgres user=postgres host=localhost port=5432"<br/>For etcd store, the format is:<br/>"127.0.0.1:2379" |
| `store_addrs` | Array | -- | Store server address(es). The format depends on the selected backend.<br/><br/>For etcd: a list of "host:port" endpoints.<br/>e.g. ["192.168.1.1:2379", "192.168.1.2:2379"]<br/><br/>For PostgreSQL: a connection string in libpq format or URI.<br/>e.g.<br/>- "host=localhost port=5432 user=postgres password=<PASSWORD> dbname=postgres"<br/>- "postgresql://user:password@localhost:5432/mydb?connect_timeout=10"<br/>The detail see: https://docs.rs/tokio-postgres/latest/tokio_postgres/config/struct.Config.html<br/><br/>For mysql store, the format is a MySQL connection URL.<br/>e.g. "mysql://user:password@localhost:3306/greptime_meta?ssl-mode=VERIFY_CA&ssl-ca=/path/to/ca.pem" |
| `store_key_prefix` | String | `""` | If it's not empty, the metasrv will store all data with this key prefix. |
| `backend` | String | `etcd_store` | The datastore for meta server.<br/>Available values:<br/>- `etcd_store` (default value)<br/>- `memory_store`<br/>- `postgres_store`<br/>- `mysql_store` |
| `meta_table_name` | String | `greptime_metakv` | Table name in RDS to store metadata. Effect when using a RDS kvbackend.<br/>**Only used when backend is `postgres_store`.** |
| `meta_schema_name` | String | `greptime_schema` | Optional PostgreSQL schema for metadata table and election table name qualification.<br/>When PostgreSQL public schema is not writable (e.g., PostgreSQL 15+ with restricted public),<br/>set this to a writable schema. GreptimeDB will use `meta_schema_name`.`meta_table_name`.<br/>GreptimeDB will NOT create the schema automatically; please ensure it exists or the user has permission.<br/>**Only used when backend is `postgres_store`.** |
| `meta_election_lock_id` | Integer | `1` | Advisory lock id in PostgreSQL for election. Effect when using PostgreSQL as kvbackend<br/>Only used when backend is `postgres_store`. |
| `use_memory_store` | Bool | `false` | Store data in memory. |
@@ -336,22 +354,28 @@
| `region_failure_detector_initialization_delay` | String | `10m` | The delay before starting region failure detection.<br/>This delay helps prevent Metasrv from triggering unnecessary region failovers before all Datanodes are fully started.<br/>Especially useful when the cluster is not deployed with GreptimeDB Operator and maintenance mode is not enabled. |
| `allow_region_failover_on_local_wal` | Bool | `false` | Whether to allow region failover on local WAL.<br/>**This option is not recommended to be set to true, because it may lead to data loss during failover.** |
| `node_max_idle_time` | String | `24hours` | Max allowed idle time before removing node info from metasrv memory. |
| `heartbeat_interval` | String | `3s` | Base heartbeat interval for calculating distributed time constants.<br/>The frontend heartbeat interval is 6 times of the base heartbeat interval.<br/>The flownode/datanode heartbeat interval is 1 times of the base heartbeat interval.<br/>e.g., If the base heartbeat interval is 3s, the frontend heartbeat interval is 18s, the flownode/datanode heartbeat interval is 3s.<br/>If you change this value, you need to change the heartbeat interval of the flownode/frontend/datanode accordingly. |
| `enable_telemetry` | Bool | `true` | Whether to enable greptimedb telemetry. Enabled by default. |
| `runtime` | -- | -- | The runtime options. |
| `runtime.global_rt_size` | Integer | `8` | The number of threads to execute the runtime for global read operations. |
| `runtime.compact_rt_size` | Integer | `4` | The number of threads to execute the runtime for global write operations. |
| `backend_tls` | -- | -- | TLS configuration for kv store backend (only applicable for PostgreSQL/MySQL backends)<br/>When using PostgreSQL or MySQL as metadata store, you can configure TLS here |
| `backend_tls` | -- | -- | TLS configuration for kv store backend (applicable for etcd, PostgreSQL, and MySQL backends)<br/>When using etcd, PostgreSQL, or MySQL as metadata store, you can configure TLS here<br/><br/>Note: if TLS is configured in both this section and the `store_addrs` connection string, the<br/>settings here will override the TLS settings in `store_addrs`. |
| `backend_tls.mode` | String | `prefer` | TLS mode, refer to https://www.postgresql.org/docs/current/libpq-ssl.html<br/>- "disable" - No TLS<br/>- "prefer" (default) - Try TLS, fallback to plain<br/>- "require" - Require TLS<br/>- "verify_ca" - Require TLS and verify CA<br/>- "verify_full" - Require TLS and verify hostname |
| `backend_tls.ca_cert_path` | String | `""` | Path to CA certificate file (for server certificate verification)<br/>Required when using custom CAs or self-signed certificates<br/>Leave empty to use system root certificates only<br/>Like "/path/to/ca.crt" |
| `backend_tls.watch` | Bool | `false` | Watch for certificate file changes and auto reload |
| `backend_client` | -- | -- | The backend client options.<br/>Currently, only applicable when using etcd as the metadata store. |
| `backend_client.keep_alive_timeout` | String | `3s` | The keep alive timeout for backend client. |
| `backend_client.keep_alive_interval` | String | `10s` | The keep alive interval for backend client. |
| `backend_client.connect_timeout` | String | `3s` | The connect timeout for backend client. |
| `grpc` | -- | -- | The gRPC server options. |
| `grpc.bind_addr` | String | `127.0.0.1:3002` | The address to bind the gRPC server. |
| `grpc.server_addr` | String | `127.0.0.1:3002` | The communication server address for the frontend and datanode to connect to metasrv.<br/>If left empty or unset, the server will automatically use the IP address of the first network interface<br/>on the host, with the same port number as the one specified in `bind_addr`. |
| `grpc.runtime_size` | Integer | `8` | The number of server worker threads. |
| `grpc.max_recv_message_size` | String | `512MB` | The maximum receive message size for gRPC server. |
| `grpc.max_send_message_size` | String | `512MB` | The maximum send message size for gRPC server. |
| `grpc.http2_keep_alive_interval` | String | `10s` | The server side HTTP/2 keep-alive interval |
| `grpc.http2_keep_alive_timeout` | String | `3s` | The server side HTTP/2 keep-alive timeout. |
| `http` | -- | -- | The HTTP server options. |
| `http.addr` | String | `127.0.0.1:4000` | The address to bind the HTTP server. |
| `http.timeout` | String | `0s` | HTTP request timeout. Set to 0 to disable timeout. |
@@ -362,10 +386,9 @@
| `procedure.max_metadata_value_size` | String | `1500KiB` | Auto split large value<br/>GreptimeDB procedure uses etcd as the default metadata storage backend.<br/>The etcd the maximum size of any request is 1.5 MiB<br/>1500KiB = 1536KiB (1.5MiB) - 36KiB (reserved size of key)<br/>Comments out the `max_metadata_value_size`, for don't split large value (no limit). |
| `procedure.max_running_procedures` | Integer | `128` | Max running procedures.<br/>The maximum number of procedures that can be running at the same time.<br/>If the number of running procedures exceeds this limit, the procedure will be rejected. |
| `failure_detector` | -- | -- | -- |
| `failure_detector.threshold` | Float | `8.0` | The threshold value used by the failure detector to determine failure conditions. |
| `failure_detector.min_std_deviation` | String | `100ms` | The minimum standard deviation of the heartbeat intervals, used to calculate acceptable variations. |
| `failure_detector.acceptable_heartbeat_pause` | String | `10000ms` | The acceptable pause duration between heartbeats, used to determine if a heartbeat interval is acceptable. |
| `failure_detector.first_heartbeat_estimate` | String | `1000ms` | The initial estimate of the heartbeat interval used by the failure detector. |
| `failure_detector.threshold` | Float | `8.0` | Maximum acceptable φ before the peer is treated as failed.<br/>Lower values react faster but yield more false positives. |
| `failure_detector.min_std_deviation` | String | `100ms` | The minimum standard deviation of the heartbeat intervals.<br/>So tiny variations don’t make φ explode. Prevents hypersensitivity when heartbeat intervals barely vary. |
| `failure_detector.acceptable_heartbeat_pause` | String | `10000ms` | The acceptable pause duration between heartbeats.<br/>Additional extra grace period to the learned mean interval before φ rises, absorbing temporary network hiccups or GC pauses. |
| `wal.broker_endpoints` | Array | -- | The broker endpoints of the Kafka cluster. |
| `wal.auto_create_topics` | Bool | `true` | Automatically create topics for WAL.<br/>Set to `true` to automatically create topics for WAL.<br/>Otherwise, use topics named `topic_name_prefix_[0..num_topics)` |
| `wal.auto_prune_interval` | String | `0s` | Interval of automatically WAL pruning.<br/>Set to `0s` to disable automatically WAL pruning which delete unused remote WAL entries periodically. |
| `wal.trigger_flush_threshold` | Integer | `0` | The threshold to trigger a flush operation of a region in automatically WAL pruning.<br/>Metasrv will send a flushrequest to flush the region when:<br/>`trigger_flush_threshold` + `prunable_entry_id`<`max_prunable_entry_id`<br/>where:<br/>- `prunable_entry_id` is the maximum entry id that can be pruned of the region.<br/>- `max_prunable_entry_id` is the maximum prunable entry id among all regions in the same topic.<br/>Set to `0` to disable the flush operation. |
| `wal.topic_name_prefix` | String | `greptimedb_wal_topic` | A Kafka topic is constructed by concatenating `topic_name_prefix` and `topic_id`.<br/>Only accepts strings that match the following regular expression pattern:<br/>[a-zA-Z_:-][a-zA-Z0-9_:\-\.@#]*<br/>i.g., greptimedb_wal_topic_0, greptimedb_wal_topic_1. |
| `wal.replication_factor` | Integer | `1` | Expected number of replicas of each partition. |
| `wal.create_topic_timeout` | String | `30s` | Above which a topic creation operation will be cancelled. |
| `wal.broker_endpoints` | Array | -- | The broker endpoints of the Kafka cluster.<br/><br/>**It's only used when the provider is `kafka`**. |
| `wal.auto_create_topics` | Bool | `true` | Automatically create topics for WAL.<br/>Set to `true` to automatically create topics for WAL.<br/>Otherwise, use topics named `topic_name_prefix_[0..num_topics)`<br/>**It's only used when the provider is `kafka`**. |
| `wal.auto_prune_interval` | String | `30m` | Interval of automatically WAL pruning.<br/>Set to `0s` to disable automatically WAL pruning which delete unused remote WAL entries periodically.<br/>**It's only used when the provider is `kafka`**. |
| `wal.flush_trigger_size` | String | `512MB` | Estimated size threshold to trigger a flush when using Kafka remote WAL.<br/>Since multiple regions may share a Kafka topic, the estimated size is calculated as:<br/> (latest_entry_id - flushed_entry_id) * avg_record_size<br/>MetaSrv triggers a flush for a region when this estimated size exceeds `flush_trigger_size`.<br/>- `latest_entry_id`: The latest entry ID in the topic.<br/>- `flushed_entry_id`: The last flushed entry ID for the region.<br/>Set to "0" to let the system decide the flush trigger size.<br/>**It's only used when the provider is `kafka`**. |
| `wal.checkpoint_trigger_size` | String | `128MB` | Estimated size threshold to trigger a checkpoint when using Kafka remote WAL.<br/>The estimated size is calculated as:<br/> (latest_entry_id - last_checkpoint_entry_id) * avg_record_size<br/>MetaSrv triggers a checkpoint for a region when this estimated size exceeds `checkpoint_trigger_size`.<br/>Set to "0" to let the system decide the checkpoint trigger size.<br/>**It's only used when the provider is `kafka`**. |
| `wal.auto_prune_parallelism` | Integer | `10` | Concurrent task limit for automatically WAL pruning.<br/>**It's only used when the provider is `kafka`**. |
| `wal.num_topics` | Integer | `64` | Number of topics used for remote WAL.<br/>**It's only used when the provider is `kafka`**. |
| `wal.selector_type` | String | `round_robin` | Topic selector type.<br/>Available selector types:<br/>- `round_robin` (default)<br/>**It's only used when the provider is `kafka`**. |
| `wal.topic_name_prefix` | String | `greptimedb_wal_topic` | A Kafka topic is constructed by concatenating `topic_name_prefix` and `topic_id`.<br/>Only accepts strings that match the following regular expression pattern:<br/>[a-zA-Z_:-][a-zA-Z0-9_:\-\.@#]*<br/>i.g., greptimedb_wal_topic_0, greptimedb_wal_topic_1.<br/>**It's only used when the provider is `kafka`**. |
| `wal.replication_factor` | Integer | `1` | Expected number of replicas of each partition.<br/>**It's only used when the provider is `kafka`**. |
| `wal.create_topic_timeout` | String | `30s` | The timeout for creating a Kafka topic.<br/>**It's only used when the provider is `kafka`**. |
| `event_recorder` | -- | -- | Configuration options for the event recorder. |
| `event_recorder.ttl` | String | `90d` | TTL for the events table that will be used to store the events. Default is `90d`. |
| `stats_persistence` | -- | -- | Configuration options for the stats persistence. |
| `stats_persistence.ttl` | String | `0s` | TTL for the stats table that will be used to store the stats.<br/>Set to `0s` to disable stats persistence.<br/>Default is `0s`.<br/>If you want to enable stats persistence, set the TTL to a value greater than 0.<br/>It is recommended to set a small value, e.g., `3h`. |
| `stats_persistence.interval` | String | `10m` | The interval to persist the stats. Default is `10m`.<br/>The minimum value is `10m`, if the value is less than `10m`, it will be overridden to `10m`. |
| `logging` | -- | -- | The logging options. |
| `logging.dir` | String | `./greptimedb_data/logs` | The directory to store the log files. If set to empty, logs will not be written to files. |
| `logging.level` | String | Unset | The log level. Can be `info`/`debug`/`warn`/`error`. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThemetasrvcanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.remote_write` | -- | -- | -- |
| `export_metrics.remote_write.url` | String | `""` | The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
| `node_id` | Integer | Unset | The datanode identifier and should be unique in the cluster. |
| `default_column_prefix` | String | Unset | The default column prefix for auto-created time index and value columns. |
| `require_lease_before_startup` | Bool | `false` | Start services after regions have obtained leases.<br/>It will block the datanode start if it can't receive leases in the heartbeat from metasrv. |
| `init_regions_in_background` | Bool | `false` | Initialize all regions in the background during the startup.<br/>By default, it provides services after all regions have been initialized. |
| `max_concurrent_queries` | Integer | `0` | The maximum current queries allowed to be executed. Zero means unlimited. |
| `max_concurrent_queries` | Integer | `0` | The maximum current queries allowed to be executed. Zero means unlimited.<br/>NOTE: This setting affects scan_memory_limit's privileged tier allocation.<br/>When set, 70% of queries get privileged memory access (full scan_memory_limit).<br/>The remaining 30% get standard tier access (70% of scan_memory_limit). |
| `enable_telemetry` | Bool | `true` | Enable telemetry to collect anonymous usage data. Enabled by default. |
| `http` | -- | -- | The HTTP server options. |
| `http.addr` | String | `127.0.0.1:4000` | The address to bind the HTTP server. |
| `wal.provider` | String | `raft_engine` | The provider of the WAL.<br/>- `raft_engine`: the wal is stored in the local file system by raft-engine.<br/>- `kafka`: it's remote wal that data is stored in Kafka. |
| `wal.provider` | String | `raft_engine` | The provider of the WAL.<br/>- `raft_engine`: the wal is stored in the local file system by raft-engine.<br/>- `kafka`: it's remote wal that data is stored in Kafka.<br/>- `noop`: it's a no-op WAL provider that does not store any WAL data.<br/>**Notes: any unflushed data will be lost when the datanode is shutdown.** |
| `wal.dir` | String | Unset | The directory to store the WAL files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.file_size` | String | `128MB` | The size of the WAL segment file.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_threshold` | String | `1GB` | The threshold of the WAL size to trigger a purge.<br/>**It's only used when the provider is `raft_engine`**. |
@@ -463,6 +485,8 @@
| `wal.sync_period` | String | `10s` | Duration for fsyncing log files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.broker_endpoints` | Array | -- | The Kafka broker endpoints.<br/>**It's only used when the provider is `kafka`**. |
| `wal.connect_timeout` | String | `3s` | The connect timeout for kafka client.<br/>**It's only used when the provider is `kafka`**. |
| `wal.timeout` | String | `3s` | The timeout for kafka client.<br/>**It's only used when the provider is `kafka`**. |
| `wal.max_batch_bytes` | String | `1MB` | The max size of a single producer batch.<br/>Warning: Kafka has a default limit of 1MB per message in a topic.<br/>**It's only used when the provider is `kafka`**. |
| `wal.consumer_wait_timeout` | String | `100ms` | The consumer wait timeout.<br/>**It's only used when the provider is `kafka`**. |
| `wal.create_index` | Bool | `true` | Whether to enable WAL index creation.<br/>**It's only used when the provider is `kafka`**. |
@@ -470,11 +494,10 @@
| `wal.overwrite_entry_start_id` | Bool | `false` | Ignore missing entries during read WAL.<br/>**It's only used when the provider is `kafka`**.<br/><br/>This option ensures that when Kafka messages are deleted, the system<br/>can still successfully replay memtable data without throwing an<br/>out-of-range error.<br/>However, enabling this option might lead to unexpected data loss,<br/>as the system will skip over missing entries instead of treating<br/>them as critical errors. |
| `query` | -- | -- | The query engine options. |
| `query.parallelism` | Integer | `0` | Parallelism of the query engine.<br/>Default to 0, which means the number of CPU cores. |
| `query.memory_pool_size` | String | `50%` | Memory pool size for query execution operators (aggregation, sorting, join).<br/>Supports absolute size (e.g., "2GB", "4GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit (unbounded, default behavior).<br/>When this limit is reached, queries will fail with ResourceExhausted error.<br/>NOTE: This does NOT limit memory used by table scans. |
| `storage` | -- | -- | The data storage options. |
| `storage.data_home` | String | `./greptimedb_data` | The working home directory. |
| `storage.type` | String | `File` | The storage type used to store the data.<br/>- `File`: the data is stored in the local file system.<br/>- `S3`: the data is stored in the S3 object storage.<br/>- `Gcs`: the data is stored in the Google Cloud Storage.<br/>- `Azblob`: the data is stored in the Azure Blob Storage.<br/>- `Oss`: the data is stored in the Aliyun OSS. |
| `storage.cache_path` | String | Unset | Read cache configuration for object storage such as 'S3' etc, it's configured by default when using object storage. It is recommended to configure it when using object storage for better performance.<br/>A local file directory, defaults to `{data_home}`. An empty string means disabling. |
| `storage.cache_capacity` | String | Unset | The local file cache capacity in bytes. If your disk space is sufficient, it is recommended to set it larger. |
| `storage.bucket` | String | Unset | The S3 bucket name.<br/>**It's only used when the storage type is `S3`, `Oss` and `Gcs`**. |
| `storage.root` | String | Unset | The S3 data will be stored in the specified prefix, for example, `s3://${bucket}/${root}`.<br/>**It's only used when the storage type is `S3`, `Oss` and `Azblob`**. |
| `storage.access_key_id` | String | Unset | The access key id of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3` and `Oss`**. |
@@ -501,10 +524,14 @@
| `region_engine.mito.worker_channel_size` | Integer | `128` | Request channel size of each worker. |
| `region_engine.mito.worker_request_batch_size` | Integer | `64` | Max batch size for a worker to handle requests. |
| `region_engine.mito.manifest_checkpoint_distance` | Integer | `10` | Number of meta action updated to trigger a new checkpoint for the manifest. |
| `region_engine.mito.experimental_manifest_keep_removed_file_count` | Integer | `256` | Number of removed files to keep in manifest's `removed_files` field before also<br/>remove them from `removed_files`. Mostly for debugging purpose.<br/>If set to 0, it will only use `keep_removed_file_ttl` to decide when to remove files<br/>from `removed_files` field. |
| `region_engine.mito.experimental_manifest_keep_removed_file_ttl` | String | `1h` | How long to keep removed files in the `removed_files` field of manifest<br/>after they are removed from manifest.<br/>files will only be removed from `removed_files` field<br/>if both `keep_removed_file_count` and `keep_removed_file_ttl` is reached. |
| `region_engine.mito.compress_manifest` | Bool | `false` | Whether to compress manifest and checkpoint file by gzip (default false). |
| `region_engine.mito.max_background_flushes` | Integer | Auto | Max number of running background flush jobs (default: 1/2 of cpu cores). |
| `region_engine.mito.max_background_compactions` | Integer | Auto | Max number of running background compaction jobs (default: 1/4 of cpu cores). |
| `region_engine.mito.max_background_purges` | Integer | Auto | Max number of running background purge jobs (default: number of cpu cores). |
| `region_engine.mito.experimental_compaction_memory_limit` | String | 0 | Memory budget for compaction tasks. Setting it to 0 or "unlimited" disables the limit. |
| `region_engine.mito.experimental_compaction_on_exhausted` | String | wait | Behavior when compaction cannot acquire memory from the budget.<br/>Options: "wait" (default, 10s), "wait(<duration>)", "fail" |
| `region_engine.mito.auto_flush_interval` | String | `1h` | Interval to auto flush a region if it has not flushed yet. |
| `region_engine.mito.global_write_buffer_size` | String | Auto | Global write buffer size for all regions. If not set, it's default to 1/8 of OS memory with a max limitation of 1GB. |
| `region_engine.mito.global_write_buffer_reject_size` | String | Auto | Global write buffer size threshold to reject write requests. If not set, it's default to 2 times of `global_write_buffer_size` |
@@ -516,11 +543,17 @@
| `region_engine.mito.write_cache_path` | String | `""` | File system path for write cache, defaults to `{data_home}`. |
| `region_engine.mito.write_cache_size` | String | `5GiB` | Capacity for write cache. If your disk space is sufficient, it is recommended to set it larger. |
| `region_engine.mito.preload_index_cache` | Bool | `true` | Preload index (puffin) files into cache on region open (default: true).<br/>When enabled, index files are loaded into the write cache during region initialization,<br/>which can improve query performance at the cost of longer startup times. |
| `region_engine.mito.index_cache_percent` | Integer | `20` | Percentage of write cache capacity allocated for index (puffin) files (default: 20).<br/>The remaining capacity is used for data (parquet) files.<br/>Must be between 0 and 100 (exclusive). For example, with a 5GiB write cache and 20% allocation,<br/>1GiB is reserved for index files and 4GiB for data files. |
| `region_engine.mito.parallel_scan_channel_size` | Integer | `32` | Capacity of the channel to send data from parallel scan tasks to the main task. |
| `region_engine.mito.max_concurrent_scan_files` | Integer | `128` | Maximum number of SST files to scan concurrently. |
| `region_engine.mito.max_concurrent_scan_files` | Integer | `384` | Maximum number of SST files to scan concurrently. |
| `region_engine.mito.allow_stale_entries` | Bool | `false` | Whether to allow stale WAL entries read during replay. |
| `region_engine.mito.scan_memory_limit` | String | `50%` | Memory limit for table scans across all queries.<br/>Supports absolute size (e.g., "2GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit.<br/>NOTE: Works with max_concurrent_queries for tiered memory allocation.<br/>- If max_concurrent_queries is set: 70% of queries get full access, 30% get 70% access.<br/>- If max_concurrent_queries is 0 (unlimited): first 20 queries get full access, rest get 70% access. |
| `region_engine.mito.min_compaction_interval` | String | `0m` | Minimum time interval between two compactions.<br/>To align with the old behavior, the default value is 0 (no restrictions). |
| `region_engine.mito.default_experimental_flat_format` | Bool | `false` | Whether to enable experimental flat format as the default format. |
| `region_engine.mito.index` | -- | -- | The options for index in Mito engine. |
| `region_engine.mito.index.aux_path` | String | `""` | Auxiliary directory path for the index in filesystem, used to store intermediate files for<br/>creating the index and staging files for searching the index, defaults to `{data_home}/index_intermediate`.<br/>The default name for this directory is `index_intermediate` for backward compatibility.<br/><br/>This path contains two subdirectories:<br/>- `__intm`: for storing intermediate files used during creating index.<br/>- `staging`: for storing staging files used during searching index. |
| `region_engine.mito.index.staging_size` | String | `2GB` | The max capacity of the staging directory. |
@@ -552,24 +585,19 @@
| `region_engine.mito.memtable.fork_dictionary_bytes` | String | `1GiB` | Max dictionary bytes.<br/>Only available for `partition_tree` memtable. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.remote_write` | -- | -- | -- |
| `export_metrics.remote_write.url` | String | `""` | The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`query.parallelism`|Integer|`1`|Parallelismofthequeryengineforquerysentbyflownode.<br/>Default to 1, so it won't use too much cpu or memory |
| `query.memory_pool_size` | String | `50%` | Memory pool size for query execution operators (aggregation, sorting, join).<br/>Supports absolute size (e.g., "1GB", "2GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit (unbounded, default behavior).<br/>When this limit is reached, queries will fail with ResourceExhausted error.<br/>NOTE: This does NOT limit memory used by table scans. |
| `memory` | -- | -- | The memory options. |
| `memory.enable_heap_profiling` | Bool | `true` | Whether to enable heap profiling activation during startup.<br/>When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable<br/>is set to "prof:true,prof_active:false". The official image adds this env variable.<br/>Default is true. |
## Read cache configuration for object storage such as 'S3' etc, it's configured by default when using object storage. It is recommended to configure it when using object storage for better performance.
## A local file directory, defaults to `{data_home}`. An empty string means disabling.
## @toml2docs:none-default
#+ cache_path = ""
## The local file cache capacity in bytes. If your disk space is sufficient, it is recommended to set it larger.
## @toml2docs:none-default
cache_capacity="5GiB"
## The S3 bucket name.
## **It's only used when the storage type is `S3`, `Oss` and `Gcs`**.
## Whether to enable the experimental sparse primary key encoding.
experimental_sparse_primary_key_encoding=false
## Whether to use sparse primary key encoding.
sparse_primary_key_encoding=true
## The logging options.
[logging]
@@ -632,7 +695,7 @@ level = "info"
enable_otlp_tracing=false
## The OTLP tracing endpoint.
otlp_endpoint="http://localhost:4318"
otlp_endpoint="http://localhost:4318/v1/traces"
## Whether to append logs to stdout.
append_stdout=true
@@ -646,27 +709,19 @@ max_log_files = 720
## The OTLP tracing export protocol. Can be `grpc`/`http`.
otlp_export_protocol="http"
## Additional OTLP headers, only valid when using OTLP http
[logging.otlp_headers]
## @toml2docs:none-default
#Authorization = "Bearer my-token"
## @toml2docs:none-default
#Database = "My database"
## The percentage of tracing will be sampled and exported.
## Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.
## ratio > 1 are treated as 1. Fractions < 0 are treated as 0
[logging.tracing_sample_ratio]
default_ratio=1.0
## The datanode can export its metrics and send to Prometheus compatible service (e.g. `greptimedb` itself) from remote-write API.
## This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
[export_metrics]
## whether enable export metrics.
enable=false
## The interval of export metrics.
write_interval="30s"
[export_metrics.remote_write]
## The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`.
url=""
## HTTP headers of Prometheus remote-write carry.
headers={}
## The tracing options. Only effect when compiled with `tokio-console` feature.
## Default to false, meaning when push down optimize failed, return error msg
allow_query_fallback=false
## Memory pool size for query execution operators (aggregation, sorting, join).
## Supports absolute size (e.g., "4GB", "8GB") or percentage of system memory (e.g., "30%").
## Setting it to 0 disables the limit (unbounded, default behavior).
## When this limit is reached, queries will fail with ResourceExhausted error.
## NOTE: This does NOT limit memory used by table scans (only applies to datanodes).
memory_pool_size="50%"
## Datanode options.
[datanode]
## Datanode client options.
@@ -221,7 +279,7 @@ level = "info"
enable_otlp_tracing=false
## The OTLP tracing endpoint.
otlp_endpoint="http://localhost:4318"
otlp_endpoint="http://localhost:4318/v1/traces"
## Whether to append logs to stdout.
append_stdout=true
@@ -235,6 +293,13 @@ max_log_files = 720
## The OTLP tracing export protocol. Can be `grpc`/`http`.
otlp_export_protocol="http"
## Additional OTLP headers, only valid when using OTLP http
[logging.otlp_headers]
## @toml2docs:none-default
#Authorization = "Bearer my-token"
## @toml2docs:none-default
#Database = "My database"
## The percentage of tracing will be sampled and exported.
## Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.
## ratio > 1 are treated as 1. Fractions < 0 are treated as 0
@@ -260,21 +325,6 @@ sample_ratio = 1.0
## The TTL of the `slow_queries` system table. Default is `90d` when `record_type` is `system_table`.
ttl="90d"
## The frontend can export its metrics and send to Prometheus compatible service (e.g. `greptimedb` itself) from remote-write API.
## This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
[export_metrics]
## whether enable export metrics.
enable=false
## The interval of export metrics.
write_interval="30s"
[export_metrics.remote_write]
## The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`.
url=""
## HTTP headers of Prometheus remote-write carry.
headers={}
## The tracing options. Only effect when compiled with `tokio-console` feature.
## **It's only used when the provider is `kafka`**.
topic_name_prefix="greptimedb_wal_topic"
## Expected number of replicas of each partition.
## **It's only used when the provider is `kafka`**.
replication_factor=1
## Above which a topic creation operation will be cancelled.
## The timeout for creating a Kafka topic.
## **It's only used when the provider is `kafka`**.
create_topic_timeout="30s"
# The Kafka SASL configuration.
@@ -245,6 +299,18 @@ create_topic_timeout = "30s"
## TTL for the events table that will be used to store the events. Default is `90d`.
ttl="90d"
## Configuration options for the stats persistence.
[stats_persistence]
## TTL for the stats table that will be used to store the stats.
## Set to `0s` to disable stats persistence.
## Default is `0s`.
## If you want to enable stats persistence, set the TTL to a value greater than 0.
## It is recommended to set a small value, e.g., `3h`.
ttl="0s"
## The interval to persist the stats. Default is `10m`.
## The minimum value is `10m`, if the value is less than `10m`, it will be overridden to `10m`.
interval="10m"
## The logging options.
[logging]
## The directory to store the log files. If set to empty, logs will not be written to files.
@@ -258,7 +324,7 @@ level = "info"
enable_otlp_tracing=false
## The OTLP tracing endpoint.
otlp_endpoint="http://localhost:4318"
otlp_endpoint="http://localhost:4318/v1/traces"
## Whether to append logs to stdout.
append_stdout=true
@@ -272,27 +338,20 @@ max_log_files = 720
## The OTLP tracing export protocol. Can be `grpc`/`http`.
otlp_export_protocol="http"
## Additional OTLP headers, only valid when using OTLP http
[logging.otlp_headers]
## @toml2docs:none-default
#Authorization = "Bearer my-token"
## @toml2docs:none-default
#Database = "My database"
## The percentage of tracing will be sampled and exported.
## Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.
## ratio > 1 are treated as 1. Fractions < 0 are treated as 0
[logging.tracing_sample_ratio]
default_ratio=1.0
## The metasrv can export its metrics and send to Prometheus compatible service (e.g. `greptimedb` itself) from remote-write API.
## This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
[export_metrics]
## whether enable export metrics.
enable=false
## The interval of export metrics.
write_interval="30s"
[export_metrics.remote_write]
## The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`.
url=""
## HTTP headers of Prometheus remote-write carry.
headers={}
## The tracing options. Only effect when compiled with `tokio-console` feature.
## Read cache configuration for object storage such as 'S3' etc, it's configured by default when using object storage. It is recommended to configure it when using object storage for better performance.
## A local file directory, defaults to `{data_home}`. An empty string means disabling.
## @toml2docs:none-default
#+ cache_path = ""
## The local file cache capacity in bytes. If your disk space is sufficient, it is recommended to set it larger.
## @toml2docs:none-default
cache_capacity="5GiB"
## The S3 bucket name.
## **It's only used when the storage type is `S3`, `Oss` and `Gcs`**.
## @toml2docs:none-default
@@ -515,6 +543,15 @@ compress_manifest = false
## @toml2docs:none-default="Auto"
#+ max_background_purges = 8
## Memory budget for compaction tasks. Setting it to 0 or "unlimited" disables the limit.
## @toml2docs:none-default="0"
#+ experimental_compaction_memory_limit = "0"
## Behavior when compaction cannot acquire memory from the budget.
## Whether to enable the experimental sparse primary key encoding.
experimental_sparse_primary_key_encoding=false
## Whether to use sparse primary key encoding.
sparse_primary_key_encoding=true
## The logging options.
[logging]
@@ -723,7 +789,7 @@ level = "info"
enable_otlp_tracing=false
## The OTLP tracing endpoint.
otlp_endpoint="http://localhost:4318"
otlp_endpoint="http://localhost:4318/v1/traces"
## Whether to append logs to stdout.
append_stdout=true
@@ -737,6 +803,13 @@ max_log_files = 720
## The OTLP tracing export protocol. Can be `grpc`/`http`.
otlp_export_protocol="http"
## Additional OTLP headers, only valid when using OTLP http
[logging.otlp_headers]
## @toml2docs:none-default
#Authorization = "Bearer my-token"
## @toml2docs:none-default
#Database = "My database"
## The percentage of tracing will be sampled and exported.
## Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.
## ratio > 1 are treated as 1. Fractions < 0 are treated as 0
@@ -760,27 +833,6 @@ default_ratio = 1.0
## @toml2docs:none-default
#+ sample_ratio = 1.0
## The standalone can export its metrics and send to Prometheus compatible service (e.g. `greptimedb`) from remote-write API.
## This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
[export_metrics]
## whether enable export metrics.
enable=false
## The interval of export metrics.
write_interval="30s"
## For `standalone` mode, `self_import` is recommended to collect metrics generated by itself
## You must create the database before enabling it.
[export_metrics.self_import]
## @toml2docs:none-default
db="greptime_metrics"
[export_metrics.remote_write]
## The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`.
url=""
## HTTP headers of Prometheus remote-write carry.
headers={}
## The tracing options. Only effect when compiled with `tokio-console` feature.
@@ -13,4 +13,19 @@ Log Level changed from Some("info") to "trace,flow=debug"%
The data is a string in the format of `global_level,module1=level1,module2=level2,...` that follows the same rule of `RUST_LOG`.
The module is the module name of the log, and the level is the log level. The log level can be one of the following: `trace`, `debug`, `info`, `warn`, `error`, `off`(case insensitive).
The module is the module name of the log, and the level is the log level. The log level can be one of the following: `trace`, `debug`, `info`, `warn`, `error`, `off`(case insensitive).
You can control heap profiling activation through configuration. Add the following to your configuration file:
```toml
[memory]
# Whether to enable heap profiling activation during startup.
# When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable
# is set to "prof:true,prof_active:false". The official image adds this env variable.
# Default is true.
enable_heap_profiling=true
```
By default, if you set `MALLOC_CONF=prof:true,prof_active:false`, the database will enable profiling during startup. You can disable this behavior by setting `enable_heap_profiling = false` in the configuration.
### Starting with environment variables
### Enable memory profiling for greptimedb binary
Start GreptimeDB instance with environment variables:
Currently, our query engine is based on DataFusion, so all aggregate function is executed by DataFusion, through its UDAF interface. You can find DataFusion's UDAF example [here](https://github.com/apache/datafusion/tree/main/datafusion-examples/examples/simple_udaf.rs). Basically, we provide the same way as DataFusion to write aggregate functions: both are centered in a struct called "Accumulator" to accumulates states along the way in aggregation.
However, DataFusion's UDAF implementation has a huge restriction, that it requires user to provide a concrete "Accumulator". Take `Median` aggregate function for example, to aggregate a `u32` datatype column, you have to write a `MedianU32`, and use `SELECT MEDIANU32(x)` in SQL. `MedianU32` cannot be used to aggregate a `i32` datatype column. Or, there's another way: you can use a special type that can hold all kinds of data (like our `Value` enum or Arrow's `ScalarValue`), and `match` all the way up to do aggregate calculations. It might work, though rather tedious. (But I think it's DataFusion's preferred way to write UDAF.)
So is there a way we can make an aggregate function that automatically match the input data's type? For example, a `Median` aggregator that can work on both `u32` column and `i32`? The answer is yes until we find a way to bypass DataFusion's restriction, a restriction that DataFusion simply doesn't pass the input data's type when creating an Accumulator.
> There's an example in `my_sum_udaf_example.rs`, take that as quick start.
# 1. Impl `AggregateFunctionCreator` trait for your accumulator creator.
You must first define a struct that will be used to create your accumulator. For example,
```Rust
#[as_aggr_func_creator]
#[derive(Debug, AggrFuncTypeStore)]
structMySumAccumulatorCreator{}
```
Attribute macro `#[as_aggr_func_creator]` and derive macro `#[derive(Debug, AggrFuncTypeStore)]` must both be annotated on the struct. They work together to provide a storage of aggregate function's input data types, which are needed for creating generic accumulator later.
> Note that the `as_aggr_func_creator` macro will add fields to the struct, so the struct cannot be defined as an empty struct without field like `struct Foo;`, neither as a new type like `struct Foo(bar)`.
Then impl `AggregateFunctionCreator` trait on it. The definition of the trait is:
You can use input data's type in methods that return output type and state types (just invoke `input_types()`).
The output type is aggregate function's output data's type. For example, `SUM` aggregate function's output type is `u64` for a `u32` datatype column. The state types are accumulator's internal states' types. Take `AVG` aggregate function on a `i32` column as example, its state types are `i64` (for sum) and `u64` (for count).
The `creator` function is where you define how an accumulator (that will be used in DataFusion) is created. You define "how" to create the accumulator (instead of "what" to create), using the input data's type as arguments. With input datatype known, you can create accumulator generically.
# 2. Impl `Accumulator` trait for your accumulator.
The accumulator is where you store the aggregate calculation states and evaluate a result. You must impl `Accumulator` trait for it. The trait's definition is:
The DataFusion basically executes aggregate like this:
1. Partitioning all input data for aggregate. Create an accumulator for each part.
2. Call `update_batch` on each accumulator with partitioned data, to let you update your aggregate calculation.
3. Call `state` to get each accumulator's internal state, the medial calculation result.
4. Call `merge_batch` to merge all accumulator's internal state to one.
5. Execute `evaluate` on the chosen one to get the final calculation result.
Once you know the meaning of each method, you can easily write your accumulator. You can refer to `Median` accumulator or `SUM` accumulator defined in file `my_sum_udaf_example.rs` for more details.
# 3. Register your aggregate function to our query engine.
You can call `register_aggregate_function` method in query engine to register your aggregate function. To do that, you have to new an instance of struct `AggregateFunctionMeta`. The struct has three fields, first is the name of your aggregate function's name. The function name is case-sensitive due to DataFusion's restriction. We strongly recommend using lowercase for your name. If you have to use uppercase name, wrap your aggregate function with quotation marks. For example, if you define an aggregate function named "my_aggr", you can use "`SELECT MY_AGGR(x)`"; if you define "my_AGGR", you have to use "`SELECT "my_AGGR"(x)`".
The second field is arg_counts ,the count of the arguments. Like accumulator `percentile`, calculating the p_number of the column. We need to input the value of column and the value of p to calculate, and so the count of the arguments is two.
The third field is a function about how to create your accumulator creator that you defined in step 1 above. Create creator, that's a bit intertwined, but it is how we make DataFusion use a newly created aggregate function each time it executes a SQL, preventing the stored input types from affecting each other. The key detail can be starting looking at our `DfContextProviderAdapter` struct's `get_aggregate_meta` method.
# (Optional) 4. Make your aggregate function automatically registered.
If you've written a great aggregate function that wants to let everyone use it, you can make it automatically register to our query engine at start time. It's quick and simple, just refer to the `AggregateFunctions::register` function in `common/function/src/scalars/aggregate/mod.rs`.
@@ -106,6 +106,37 @@ This mechanism may be too complex to implement at once. We can consider a two-ph
Also the read replica shouldn't be later in manifest version for more than the lingering time of obsolete files, otherwise it might ref to files that are already deleted by the GC worker.
- need to upload tmp manifest to object storage, which may introduce additional complexity and potential performance overhead. But since long-running queries are typically not frequent, the performance impact is expected to be minimal.
one potential race condition with region-migration is illustrated below:
```mermaid
sequenceDiagram
participant gc_worker as GC Worker(same dn as region 1)
participant region1 as Region 1 (Leader → Follower)
participant region2 as Region 2 (Follower → Leader)
participant region_dir as Region Directory
gc_worker->>region1: Start GC, get region manifest
activate region1
region1-->>gc_worker: Region 1 manifest
deactivate region1
gc_worker->>region_dir: Scan region directory
Note over region1,region2: Region Migration Occurs
region1-->>region2: Downgrade to Follower
region2-->>region1: Becomes Leader
region2->>region_dir: Add new file
gc_worker->>region_dir: Continue scanning
gc_worker-->>region_dir: Discovers new file
Note over gc_worker: New file not in Region 1's manifest
gc_worker->>gc_worker: Mark file as orphan(incorrectly)
```
which could cause gc worker to incorrectly mark the new file as orphan and delete it, if config the lingering time for orphan files(files not mentioned anywhere(in used or unused)) is not long enough.
A good enough solution could be to use lock to prevent gc worker to happen on the region if region migration is happening on the region, and vise versa.
The race condition between gc worker and repartition also needs to be considered carefully. For now, acquiring lock for both region-migration and repartition during gc worker process could be a simple solution.
This RFC proposes an asynchronous index build mechanism in the database, with a configuration option to choose between synchronous and asynchronous modes, aiming to improve flexibility and adapt to different workload requirements.
# Motivation
Currently, index creation is performed synchronously, which may lead to prolonged write suspension and impact business continuity. As data volume grows, the time required for index building increases significantly. An asynchronous solution is urgently needed to enhance user experience and system throughput.
# Details
## Overview
The following table highlights the difference between async and sync index approach:
| Approach | Trigger | Data Source | Additional Index Metadata Installation | Fine-grained `FileMeta` Index |
| :--- | :--- | :--- | :--- | :--- |
| Sync Index | On `write_sst` | Memory (on flush) / Disk (on compact) | Not required(already installed synchronously) | Not required |
| Async Index | 4 trigger types | Disk | Required | Required |
The index build mode (synchronous or asynchronous) can be selected via configuration file.
### Four Trigger Types
This RFC introduces four `IndexBuildType`s to trigger index building:
- **Manual Rebuild**: Triggered by the user via `ADMIN build_index("table_name")`, for scenarios like recovering from failed builds or migrating data. SST files whose `ColumnIndexMetadata` (see below) is already consistent with the `RegionMetadata` will be skipped.
- **Schema Change**: Automatically triggered when the schema of an indexed column is altered.
- **Flush**: Automatically builds indexes for new SST files created by a flush.
- **Compact**: Automatically builds indexes for new SST files created by a compaction.
### Additional Index Metadata Installation
Previously, index information in the in-memory `FileMeta` was updated synchronously. The async approach requires an explicit installation step.
A race condition can occur when compaction and index building run concurrently, leading to:
1. Building an index for a file that is about to be deleted by compaction.
2. Creating an unnecessary index file and an incorrect manifest record.
3. On restart, replaying the manifest could load metadata for a non-existent file.
To prevent this, the system checks if a file's `FileMeta` is in a `compacting` state before updating the manifest. If it is, the installation is aborted.
### Fine-grained `FileMeta` Index
The original `FileMeta` only stored file-level index information. However, manual rebuilds require column-level details to identify files inconsistent with the current DDL. Therefore, the `indexes` field in `FileMeta` is updated as follows:
```rust
structFileMeta{
...
// From file-level:
// available_indexes: SmallVec<[IndexType; 4]>
// To column-level:
indexes: Vec<ColumnIndexMetadata>,
...
}
pubstructColumnIndexMetadata{
pubcolumn_id: ColumnId,
pubcreated_indexes: IndexTypes,
}
```
## Process
The index building process is similar to a flush and is illustrated below:
```mermaid
sequenceDiagram
Region0->>Region0: Triggered by one of 4 conditions, targets specific files
loop For each target file
Region0->>IndexBuildScheduler: Submits an index build task
end
IndexBuildScheduler->>IndexBuildTask: Executes the task
IndexBuildTask->>Storage Interfaces: Reads SST data from disk
IndexBuildTask->>IndexBuildTask: Builds the index file
Region0->>Storage Interfaces: Updates manifest and Version
end
```
### Task Triggering and Scheduling
The process starts with one of the four `IndexBuildType` triggers. In `handle_rebuild_index`, the `RegionWorkerLoop` identifies target SSTs from the request or the current region version. It then creates an `IndexBuildTask` for each file and submits it to the `index_build_scheduler`.
Similar to Flush and Compact operations, index build tasks are ultimately dispatched to the LocalScheduler. Resource usage can be adjusted via configuration files. Since asynchronous index tasks are both memory-intensive and IO-intensive but have lower priority, it is recommended to allocate fewer resources to them compared to compaction and flush tasks—for example, limiting them to 1/8 of the CPU cores.
### Index Building and Notification
The scheduled `IndexBuildTask` executes its `index_build` method. It uses an `indexer_builder` to create an `Indexer` that reads SST data and builds the index. If a new index file is created (`IndexOutput.file_size > 0`), the task sends an `IndexBuildFinished` notification back to the `RegionWorkerLoop`.
### Index Metadata Installation
Upon receiving the `IndexBuildFinished` notification in `handle_index_build_finished`, the `RegionWorkerLoop` verifies that the file still exists in the current `version` and is not being compacted. If the check passes, it calls `manifest_ctx.update_manifest` to apply a `RegionEdit` with the new index information, completing the installation.
# Drawbacks
Asynchronous index building may consume extra system resources, potentially affecting overall performance during peak periods.
There may be a delay before the new index becomes available for queries, which could impact certain use cases.
# Unresolved Questions and Future Work
**Resource Management and Throttling**: The resource consumption (CPU, I/O) of background index building can be managed and limited to some extent by configuring a dedicated background thread pool. However, this approach cannot fully eliminate resource contention, especially under heavy workloads or when I/O is highly competitive. Additional throttling mechanisms or dynamic prioritization may still be necessary to avoid impacting foreground operations.
# Alternatives
Instead of being triggered by events like Flush or Compact, index building could be performed in batches during scheduled maintenance windows. This offers predictable resource usage but delays index availability.
This RFC proposes a redesign of the flow architecture where flownode becomes a lightweight in-memory state management node with an embedded frontend for direct computation. This approach optimizes resource utilization and improves scalability by eliminating network hops while maintaining clear separation between coordination and computation tasks.
## Motivation
The current flow architecture has several limitations:
1.**Resource Inefficiency**: Flownodes perform both state management and computation, leading to resource duplication and inefficient utilization.
2.**Scalability Constraints**: Computation resources are tied to flownode instances, limiting horizontal scaling capabilities.
3.**State Management Complexity**: Mixing computation with state management makes the system harder to maintain and debug.
4.**Network Overhead**: Additional network hops between flownode and separate frontend nodes add latency.
The laminar Flow architecture addresses these issues by:
- Consolidating computation within flownode through embedded frontend
- Eliminating network overhead by removing separate frontend node communication
- Simplifying state management by focusing flownode on its core responsibility
- Improving system scalability and maintainability
## Details
### Architecture Overview
The laminar Flow architecture transforms flownode into a lightweight coordinator that maintains flow state with an embedded frontend for computation. The key components involved are:
1.**Flownode**: Maintains in-memory state, coordinates computation, and includes an embedded frontend for query execution
2.**Embedded Frontend**: Executes **incremental** computations within the flownode
3.**Datanode**: Stores final results and source data
The datanode processes these parameters to return only the data within the specified sequence ranges, ensuring efficient incremental processing.
#### Sequence Invalidation and Refill Mechanism
A critical challenge occurs when data referenced by `memtable_last_seq` gets flushed from memory to disk. Since SST files only maintain a single maximum sequence number for the entire file (rather than per-record sequence tracking), precise incremental queries become impossible for the affected time ranges.
**Detection of Invalidation:**
```rust
// When memtable_last_seq data has been flushed to SST
ifmemtable_last_seq_flushed_to_disk{
// Incremental query is no longer feasible
// Need to trigger refill for affected time ranges
}
```
**Refill Process:**
1.**Identify Affected Time Range**: Query the time range corresponding to the flushed `memtable_last_seq` data
2.**Full Recomputation**: Execute a complete aggregation query for the affected time windows
3.**State Replacement**: Replace the existing flow state for these time ranges with newly computed values
4.**Sequence Update**: Update `memtable_last_seq` to the current latest sequence, while `sst_last_seq` continues normal incremental updates
```sql
-- Refill query when memtable data has been flushed
SELECT
__aggr_state(aggregation_functions)asstate,
time_window,
group_keys
FROMsource_table
WHERE
timestamp>=:affected_time_start
ANDtimestamp<:affected_time_end
-- Full scan required since sequence precision is lost in SST
GROUPBYtime_window,group_keys;
```
#### Datanode Implementation Requirements
Datanode must implement enhanced query processing capabilities to support sequence-based incremental reads:
**Input Processing:**
- Accept `memtable_last_seq` and `sst_last_seq` parameters in query requests
- Filter data based on sequence ranges across both memtable and SST storage layers
**Output Enhancement:**
```rust
structOutputMeta{
pubplan: Option<Arc<dynExecutionPlan>>,
pubcost: OutputCost,
pubsequence_info: HashMap<RegionId,SequenceInfo>,// New field for sequence tracking per regions involved in the query
}
structSequenceInfo{
// Sequence tracking for next iteration
max_memtable_seq: SequenceNumber,// Highest sequence from memtable in this result
max_sst_seq: SequenceNumber,// Highest sequence from SST in this result
}
```
**Sequence Tracking Logic:**
datanode already impl `max_sst_seq` in leader range read, can reuse similar logic for `max_memtable_seq`.
#### Sequence Update Strategy
**Normal Incremental Updates:**
- Update both `memtable_last_seq` and `sst_last_seq` after successful query execution
- Use returned `max_memtable_seq` and `max_sst_seq` values for next iteration
**Refill Scenario:**
- Reset `memtable_last_seq` to current maximum after refill completion
- Continue normal `sst_last_seq` updates based on successful query responses
- Maintain separate tracking to detect future flush events
#### Performance Considerations
**Sequence Range Optimization:**
- Minimize sequence range spans to reduce scan overhead
- Batch multiple small incremental updates when beneficial
- Balance between query frequency and processing efficiency
**Memory Management:**
- Monitor memtable flush frequency to predict refill requirements
- Implement adaptive query scheduling based on flush patterns
- Optimize state storage to handle frequent updates efficiently
This sequential read implementation ensures reliable incremental processing while gracefully handling the complexities of storage architecture, maintaining both correctness and performance in the face of background compaction and flush operations.
## Implementation Plan
### Phase 1: Core Infrastructure
1.**State Management**: Implement in-memory state map in flownode
2.**Query Interface**: Integrate `__aggr_state` query interface in embedded frontend(Already done in previous query pushdown optimizer work)
3.**Basic Coordination**: Implement query dispatch and result collection
4.**Sequence Tracking**: Implement sequence-based incremental processing(Can use similar interface which leader range read use)
After phase 1, the system should support basic flow operations with incremental updates.
### Phase 2: Optimization Features
1.**Refill Logic**: Develop state recovery mechanisms
- **Datanode Optimization**: Optimize result writing from flownode
- **Metasrv Coordination**: Enhanced metadata management and coordination
## Conclusion
The laminar Flow architecture represents a significant improvement over the current flow system by separating state management from computation execution. This design enables better resource utilization, improved scalability, and simplified maintenance while maintaining the core functionality of continuous aggregation.
The key benefits include:
1.**Improved Scalability**: Computation can scale independently of state management
While the architecture introduces some complexity in terms of distributed coordination and error handling, the benefits significantly outweigh the drawbacks, making it a compelling evolution of the flow system.
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
| Datanode Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$datanode"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{app="greptime-datanode"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{instance}}]-[{{ pod }}]` |
| Datanode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$datanode"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{app="greptime-datanode"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$frontend"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{app="greptime-frontend"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$frontend"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{app="greptime-frontend"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]-cpu` |
| Metasrv Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$metasrv"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{app="greptime-metasrv"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]-resident` |
| Metasrv CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$metasrv"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{app="greptime-metasrv"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$flownode"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{app="greptime-flownode"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$flownode"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{app="greptime-flownode"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Datanode Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$datanode"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{instance=~"$datanode"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{instance}}]-[{{ pod }}]` |
| Datanode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$datanode"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{instance=~"$datanode"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$frontend"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{instance=~"$frontend"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$frontend"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{instance=~"$frontend"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]-cpu` |
| Metasrv Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$metasrv"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{instance=~"$metasrv"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]-resident` |
| Metasrv CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$metasrv"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{instance=~"$metasrv"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$flownode"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{instance=~"$flownode"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$flownode"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{instance=~"$flownode"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
# Frontend Requests
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
@@ -87,6 +87,13 @@
| Other Request P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode", operation!~"read\|write\|list\|Writer::write\|Writer::close\|Reader::read"}[$__rate_interval])))` | `timeseries` | Other Request P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Opendal traffic | `sum by(instance, pod, scheme, operation) (rate(opendal_operation_bytes_sum{instance=~"$datanode"}[$__rate_interval]))` | `timeseries` | Total traffic as in bytes by instance and operation | `prometheus` | `decbytes` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
@@ -103,6 +110,8 @@
| Meta KV Ops Latency | `histogram_quantile(0.99, sum by(pod, le, op, target) (greptime_meta_kv_request_elapsed_bucket))` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `s` | `{{pod}}-{{op}} p99` |
| Rate of meta KV Ops | `rate(greptime_meta_kv_request_elapsed_count[$__rate_interval])` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `none` | `{{pod}}-{{op}} p99` |
| DDL Latency | `histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_tables_bucket))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_table))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_view))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_flow))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_drop_table))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_alter_table))` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `s` | `CreateLogicalTables-{{step}} p90` |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
| Datanode Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{app="greptime-datanode"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{instance}}]-[{{ pod }}]` |
| Datanode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{app="greptime-datanode"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{app="greptime-frontend"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{app="greptime-frontend"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]-cpu` |
| Metasrv Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{app="greptime-metasrv"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]-resident` |
| Metasrv CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{app="greptime-metasrv"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{app="greptime-flownode"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{app="greptime-flownode"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Datanode Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{instance}}]-[{{ pod }}]` |
| Datanode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]-cpu` |
| Metasrv Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]-resident` |
| Metasrv CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
# Frontend Requests
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
@@ -87,6 +87,13 @@
| Other Request P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{ operation!~"read\|write\|list\|Writer::write\|Writer::close\|Reader::read"}[$__rate_interval])))` | `timeseries` | Other Request P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Opendal traffic | `sum by(instance, pod, scheme, operation) (rate(opendal_operation_bytes_sum{}[$__rate_interval]))` | `timeseries` | Total traffic as in bytes by instance and operation | `prometheus` | `decbytes` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
@@ -103,6 +110,8 @@
| Meta KV Ops Latency | `histogram_quantile(0.99, sum by(pod, le, op, target) (greptime_meta_kv_request_elapsed_bucket))` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `s` | `{{pod}}-{{op}} p99` |
| Rate of meta KV Ops | `rate(greptime_meta_kv_request_elapsed_count[$__rate_interval])` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `none` | `{{pod}}-{{op}} p99` |
| DDL Latency | `histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_tables_bucket))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_table))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_view))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_flow))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_drop_table))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_alter_table))` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `s` | `CreateLogicalTables-{{step}} p90` |
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.