* feat(mysql): add SHOW WARNINGS support and return warnings for unsupported SET variables
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat(function): add MySQL IF() function and PostgreSQL description functions for connector compatibility
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: show tables for mysql
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: partitions table in information_schema and add starrocks external catalog compatibility
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* refactor: async udf
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: set warnings
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: impl pg_my_temp_schema and make description functions simple
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* test: add test for issue 7313
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: apply suggestions
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: partition_expression and partition_description
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: test
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: unit tests
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: saerch_path only works for pg
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: improve warnings processing
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: warnings while writing affected rows and refactor
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: improve ShobjDescriptionFunction signature
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* refactor: array_to_boolean
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* chore: return meaningful message when content type mismatch in otel
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* refactor: extract duplicated code
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: use a new error for failing to decode loki request
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: gc ctx&procedure
Signed-off-by: discord9 <discord9@163.com>
* fix: handle region not found case
Signed-off-by: discord9 <discord9@163.com>
* docs: more explain&todo
Signed-off-by: discord9 <discord9@163.com>
* per review
Signed-off-by: discord9 <discord9@163.com>
* chore: add time for region gc
Signed-off-by: discord9 <discord9@163.com>
* fix: explain why loader for gc region should fail
Signed-off-by: discord9 <discord9@163.com>
---------
Signed-off-by: discord9 <discord9@163.com>
* feat: split batches by rule in build_flat_sources()
It checks the num_series and splits batches when the series cardinality
is low
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: panic when no num_series available
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: don't subtract file index if checking mem range
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: update comments and control flow
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: divide parquet and puffin index
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: download index files when we open the region
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: use different label for parquet/puffin
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: control parallelism and cache size by env
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: change gauge to counter
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: correct file type labels in file cache
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: move env to config and change cache ratio to percent
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: checks capacity before download and refine metrics
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: change open to return MitoRegionRef
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: extract download to FileCache
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: run load cache task in write cache
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: check region state before downloading files
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: update config docs and test
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: use file id from index_file_id to compute puffin key
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: skip loading cache in some states
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix/region-expire-state:
Refactor region state handling in compaction task and manifest updates
- Introduce a variable to hold the current region state for clarity in compaction task updates.
- Add an expected_region_state field to RegionEditResult to manage region state expectations during manifest handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/region-expire-state:
Refactor region state handling in compaction task
- Replace direct assignment of `RegionLeaderState::Writable` with dynamic state retrieval and conditional check for leader state.
- Modify `RegionEditResult` to include a flag `update_region_state` instead of `expected_region_state` to indicate if the region state should be updated to writable.
- Adjust handling of `RegionEditResult` in `handle_manifest` to conditionally update region state based on the new flag.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-fuzz-input-override:
Add environment override for fuzzing parameters and seed values
- Implement `get_fuzz_override` function to read override values from environment variables for fuzzing parameters.
- Allow overriding `SEED`, `ACTIONS`, `ROWS`, `TABLES`, `COLUMNS`, `INSERTS`, and `PARTITIONS` in various fuzzing targets.
- Introduce new constants `GT_FUZZ_INPUT_MAX_PARTITIONS` and `FUZZ_OVERRIDE_PREFIX`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-fuzz-input-override: Remove GT_FUZZ_INPUT_MAX_PARTITIONS constant and usage from fuzzing utils and tests
• Deleted the GT_FUZZ_INPUT_MAX_PARTITIONS constant from fuzzing utility functions.
• Updated FuzzInput struct in fuzz_migrate_mito_regions.rs to use a hardcoded range instead of get_gt_fuzz_input_max_partitions for determining the number of partitions.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/allow-fuzz-input-override:
Improve fuzzing documentation with environment variable overrides
Enhanced the fuzzing instructions in the README to include guidance on how to override fuzz input using environment variables, providing an example for better clarity.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat(expr): support vec_elem_avg function
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: support vec_avg function
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* test: add more query test for avg aggregator
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* fix: fix the merge batch mode
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* refactor: use sum and count as state for avg function
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* refactor: refactor merge batch mode for avg function
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: add additional vector restrictions for validation
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
---------
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
Co-authored-by: Yingwen <realevenyag@gmail.com>
* fix/pick-continue:
### Add Tests for TWCS Compaction Logic
- **`twcs.rs`**:
- Modified the logic in `TwcsPicker` to handle cases with zero runs by using `continue` instead of `return`.
- Added two new test cases: `test_build_output_multiple_windows_with_zero_runs` and `test_build_output_single_window_zero_runs` to verify the behavior of the compaction logic when there are zero runs in
the windows.
- **`memtable_util.rs`**:
- Removed unused import `PredicateGroup`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: clippy
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/pick-continue:
### Commit Message
Enhance Compaction Process with Expired SST Handling and Testing
- **`compactor.rs`**:
- Introduced handling for expired SSTs by updating the manifest immediately upon task completion.
- Added new test cases to verify the handling of expired SSTs and manifest updates.
- **`task.rs`**:
- Implemented `remove_expired` function to handle expired SSTs by updating the manifest and notifying the region worker loop.
- Refactored `handle_compaction` to `handle_expiration_and_compaction` to integrate expired SST removal before merging inputs.
- Added logging and error handling for expired SST removal process.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/progressive-compaction:
**Enhance Compaction Task Error Handling**
- Updated `task.rs` to conditionally execute the removal of expired SST files only when they exist, improving error handling and performance.
- Added a check for non-empty `expired_ssts` before initiating the removal process, ensuring unnecessary operations are avoided.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/progressive-compaction:
### Refactor `DefaultCompactor` to Extract `merge_single_output` Method
- **File**: `src/mito2/src/compaction/compactor.rs`
- Extracted the logic for merging a single compaction output into SST files into a new method `merge_single_output` within the `DefaultCompactor` struct.
- Simplified the `merge_ssts` method by utilizing the new `merge_single_output` method, reducing code duplication and improving maintainability.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/progressive-compaction:
### Add Max Background Compaction Tasks Configuration
- **`compaction.rs`**: Added `max_background_compactions` to the compaction scheduler to limit background tasks.
- **`compaction/compactor.rs`**: Removed immediate manifest update logic after task completion.
- **`compaction/picker.rs`**: Introduced `max_background_tasks` parameter in `new_picker` to control task limits.
- **`compaction/twcs.rs`**: Updated `TwcsPicker` to include `max_background_tasks` and truncate inputs exceeding this limit. Added related test cases to ensure functionality.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/pick-continue:
### Improve Error Handling and Task Management in Compaction
- **`task.rs`**: Enhanced error handling in `remove_expired` function by logging errors without halting the compaction process. Removed the return of `Result` type and added detailed logging for various
failure scenarios.
- **`twcs.rs`**: Adjusted task management logic by removing input truncation based on `max_background_tasks` and instead discarding remaining tasks if the output size exceeds the limit. This ensures better
control over task execution and resource management.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/pick-continue:
### Add Unit Tests for Compaction Task and TWCS Picker
- **`task.rs`**: Added unit tests to verify the behavior of `PickerOutput` with and without expired SSTs.
- **`twcs.rs`**: Introduced tests for `TwcsPicker` to ensure correct handling of `max_background_tasks` during compaction, including scenarios with and without task truncation.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/pick-continue:
**Improve Error Handling and Notification in Compaction Task**
- **File:** `task.rs`
- Changed log level from `warn` to `error` for manifest update failures to enhance error visibility.
- Refactored the notification mechanism for expired file removal by using `BackgroundNotify::RegionEdit` with `RegionEditResult` to streamline the process.
- Simplified error handling by consolidating match cases into a single `if let Err` block for better readability and maintainability.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* test: add tests for scanning append mode before flush
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: extract a function maybe_dedup_one
Signed-off-by: evenyag <realevenyag@gmail.com>
* ci: add flat format to docs.yml so we can make it required later
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* mito2: add unit test for flat single-range append_mode dedup behavior
Verify memtable_flat_sources skips dedup when append_mode is true and
performs dedup otherwise for single-range flat memtables, preventing
regressions in the new append_mode path.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/flat-source-merge:
### Improve Column Metadata Extraction Logic
- **File**: `src/common/meta/src/ddl/utils.rs`
- Modified the `extract_column_metadatas` function to use `swap_remove` for extracting the first schema and decode column metadata for comparison instead of raw bytes. This ensures that the extension map is considered during
verification, enhancing the robustness of metadata consistency checks across datanodes.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: add arrow json extension type
* feat: add json structure settings to extension type
* refactor: store json structure settings as extension metadata
* chore: make binary an acceptable type for extension
* chore/add-region-insert-failure-metric: Add metric for failed insert requests to region server in datanode module
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/add-region-insert-failure-metric:
Add metric for tracking failed region server requests
- Introduce a new metric `REGION_SERVER_REQUEST_FAILURE_COUNT` to count failed region server requests.
- Update `REGION_SERVER_INSERT_FAIL_COUNT` metric description for consistency.
- Implement error handling in `RegionServerHandler` to increment the new failure metric on request errors.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: potential failure in the test_index_build_type_compact test
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: relax timestamp checking in test_timestamp_default_now
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat/objbench-subcmd:
### Add Object Storage Benchmark Tool and Update Dependencies
- **`Cargo.lock` & `Cargo.toml`**: Added dependencies for `colored`, `parquet`, and `pprof` to support new features.
- **`datanode.rs`**: Introduced `ObjbenchCommand` for benchmarking object storage, including command-line options for configuration and execution. Added `StorageConfig` and `StorageConfigWrapper` for storage engine configuration.
- **`datanode.rs`**: Implemented a stub for `build_object_store` function to initialize object storage.
These changes introduce a new subcommand for object storage benchmarking and update dependencies to support additional functionality.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* init
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: code style and clippy
* feat/objbench-subcmd:
Improve error handling in `objbench.rs`
- Enhanced error handling in `parse_config` and `parse_file_dir_components` functions by replacing `unwrap` with `OptionExt` and `context` for better error messages.
- Updated `build_access_layer_simple` and `build_cache_manager` functions to use `map_err` for more descriptive error handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore: rebase main
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
again
false by default
test: config api
refactor: per code review
less info!
even less info!!
docs: gc regions instr
refactor: grp by region id
per code review
per review
error handling?
test: fix
todos
aft rebase fix
after refactor
Signed-off-by: discord9 <discord9@163.com>
* feat/gdump:
### Add Support for Jemalloc Gdump Flag
- **`jemalloc.rs`**: Introduced `PROF_GDUMP` constant and added functions `set_gdump_active` and `is_gdump_active` to manage the gdump flag.
- **`error.rs`**: Added error handling for reading and updating the jemalloc gdump flag with `ReadGdump` and `UpdateGdump` errors.
- **`lib.rs`**: Exposed `is_gdump_active` and `set_gdump_active` functions for non-Windows platforms.
- **`http.rs`**: Added HTTP routes for checking and toggling the jemalloc gdump flag status.
- **`mem_prof.rs`**: Implemented handlers `gdump_toggle_handler` and `gdump_status_handler` for managing gdump flag via HTTP requests.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* Update docs/how-to/how-to-profile-memory.md
Co-authored-by: shuiyisong <113876041+shuiyisong@users.noreply.github.com>
* fix: typo in docs
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
Co-authored-by: shuiyisong <113876041+shuiyisong@users.noreply.github.com>
* feat: adds format, regex_extract function and more type tests
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: forgot functions
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: forgot null type
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* test: forgot date type
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: remove format function
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* test: update results after upgrading datafusion
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* perf: only decode primary keys in the batch
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: don't push none to creator
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: implement method to filter __table_id for sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: filter table id for sparse encoding separately
The __table_id doesn't present in projection so we have to filter it
manually
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: decode tags for sparse encoding when building bloom filter
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support inverted index for tags under sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: skip tag columns in fulltext index
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: fix warnings
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: fix list index metadata test
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: decode primary key columns to filter
When primary key columns are not in projection but in filters, we need
to decode them in compute_filter_mask_flat
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: reuse filter method
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: only use dictionary for string type in compat
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: safe to get column by creator's column id
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: support flat in basic_test
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: support flat in alter_test
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: support flat for append_mode_test
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update bump_committed_sequence_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update close_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update compaction_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update create_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update edit_region_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update merge_mode_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update parallel_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update projection_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update prune_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update row_selector_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update scan_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update drop_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update filter_deleted_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update sync_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update set_role_state_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update staging_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update truncate_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update catchup_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update flush_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update open_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: update batch_open_test to test both formats
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: fix all flat format tests
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat/manual-compaction-parallelism:
### Add Parallelism Support to Compaction Requests
- **`Cargo.lock` & `Cargo.toml`**: Updated `greptime-proto` dependency to a new revision.
- **`flush_compact_table.rs`**: Enhanced `parse_compact_params` to support a new `parallelism` parameter, allowing users to
specify the level of parallelism for table compaction.
- **`handle_compaction.rs`**: Integrated `parallelism` into the compaction scheduling process, defaulting to 1 if not
specified.
- **`request.rs` & `region_request.rs`**: Modified `CompactRequest` to include `parallelism`, with logic to handle unspecifie
values.
- **`requests.rs`**: Updated `CompactTableRequest` structure to include an optional `parallelism` field.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/manual-compaction-parallelism:
### Commit Message
Enhance Compaction Request Handling
- **`flush_compact_table.rs`**:
- Renamed `parse_compact_params` to `parse_compact_request`.
- Introduced `DEFAULT_COMPACTION_PARALLELISM` constant.
- Updated parsing logic to handle keyword arguments for `strict_window` and `regular` compaction types, including `parallelism` and `window`.
- Modified tests to reflect changes in parsing logic and default parallelism handling.
- **`request.rs`**:
- Updated `parallelism` handling in `RegionRequestBody::Compact` to use the new default value.
- **`requests.rs`**:
- Changed `CompactTableRequest` to use a non-optional `parallelism` field with a default value of 1.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/manual-compaction-parallelism:
### Update `flush_compact_table.rs` Parameter Validation
- Modified parameter validation in `flush_compact_table.rs` to restrict the maximum number of parameters from 4 to 3 in the `parse_compact_request` function.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/manual-compaction-parallelism:
Update `greptime-proto` dependency
- Updated the `greptime-proto` dependency to a new revision in both `Cargo.lock` and `Cargo.toml`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: struct value
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: update for proto module
* feat: wip struct type
* feat: implement more vector operations
* feat: make datatype and api
* feat: reoslve some compilation issues
* feat: resolve all compilation issues
* chore: format update
* test: resolve tests
* test: test and refactor value-to-pb
* feat: add more tests and fix for value types
* chore: remove dbg
* feat: test and fix iterator
* fix: resolve struct_type issue
* feat: pgwire 0.33 update
* refactor: use vec for struct items
* feat: conversion from json to value
* feat: add decode function
* fix: lint issue
* feat: update how we encode raw data
* feat: add convertion to fully strcutured StructValue
* refactor: take owned value in all encode/decode functions
* feat: add pg serialization of structvalue
* chore: toml format
* refactor: adopt new and try_new from struct value
* chore: cleanup residual issues
* docs: docs up
* fix lint issue
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* chore: address review comment especially collection capacity
* refactor: remove unneeded processed keys collection
* feat: Value::Json type
* chore: add some work in progress changes
* feat: adopt new json type
* refactor: limit scope json conversion functions
* fix: self review update
* test: provide tests for value::json
* test: add tests for api/helper
* switch proto to main branch
* fix: implement is_null for ValueRef::Json
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* feat: add updated_on to tablemeta with a default of created_on
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: support the update_on on alter procedure
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: add updated_on into information_schema.tables
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* fix: make sqlness happy
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* test: add test case for tablemeta update
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* fix: fix failing test for ALTER TABLE
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: use created_on as default for updated_on when missing
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
---------
Signed-off-by: Alan Tang <jmtangcs@gmail.com>
* feat: struct value
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: update for proto module
* feat: wip struct type
* feat: implement more vector operations
* feat: make datatype and api
* feat: reoslve some compilation issues
* feat: resolve all compilation issues
* chore: format update
* test: resolve tests
* test: test and refactor value-to-pb
* feat: add more tests and fix for value types
* chore: remove dbg
* feat: test and fix iterator
* fix: resolve struct_type issue
* feat: pgwire 0.33 update
* refactor: use vec for struct items
* feat: conversion from json to value
* feat: add decode function
* fix: lint issue
* feat: update how we encode raw data
* feat: add convertion to fully strcutured StructValue
* refactor: take owned value in all encode/decode functions
* feat: add pg serialization of structvalue
* chore: toml format
* refactor: adopt new and try_new from struct value
* chore: cleanup residual issues
* docs: docs up
* fix lint issue
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* Apply suggestion from @MichaelScofield
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* chore: address review comment especially collection capacity
* refactor: remove unneeded processed keys collection
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
* feat: struct value
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: update for proto module
* feat: wip struct type
* feat: implement more vector operations
* feat: make datatype and api
* feat: reoslve some compilation issues
* feat: resolve all compilation issues
* chore: format update
* test: resolve tests
* test: test and refactor value-to-pb
* feat: add more tests and fix for value types
* chore: remove dbg
* feat: test and fix iterator
* fix: resolve struct_type issue
* refactor: use vec for struct items
* chore: update proto to main branch
* refactor: address some of review issues
* refactor: update for further review
* Add validation on new methods
* feat: update struct/list json serialization
* refactor: reimplement get in struct_vector
* refactor: struct vector functions
* refactor: fix lint issue
* refactor: address review comments
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: align influxdb line timestamp with table time index
Signed-off-by: luofucong <luofc@foxmail.com>
* fix ci
Signed-off-by: luofucong <luofc@foxmail.com>
---------
Signed-off-by: luofucong <luofc@foxmail.com>
* tests: fix unit test by passing one sort columns
Signed-off-by: discord9 <discord9@163.com>
* chore: per copilot
Signed-off-by: discord9 <discord9@163.com>
---------
Signed-off-by: discord9 <discord9@163.com>
* refactor: cleanup datafusion-pg-catalog dependencies
Signed-off-by: Ning Sun <sunning@greptime.com>
* chore: toml format
* feat: update upstream
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* fix: not applied
Signed-off-by: discord9 <discord9@163.com>
* chore: per review
Signed-off-by: discord9 <discord9@163.com>
* test: confirm order by not push down
Signed-off-by: discord9 <discord9@163.com>
---------
Signed-off-by: discord9 <discord9@163.com>
* feat: use correct projection index for old format
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: remove allow dead_code from format
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: check and convert old format to flat format
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: sub primary key num from projection
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: always convert the batch in FlatRowGroupReader
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: Change &Option<&[]> to Option<&[]>
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: only build arrow schema once
adds a method flat_sst_arrow_schema_column_num() to get the field num
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: Handle flat format and old format separately
Adds two structs ParquetFlat and ParquetPrimaryKeyToFlat.
ParquetPrimaryKeyToFlat delegates stats and projection to the
PrimaryKeyReadFormat.
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: handle non string tag correctly
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: do not register file cache twice
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: clean temp files
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: add rows and bytes to flush success log
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: convert format in memtable
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: add compaction flag to ScanInput
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: compaction should use old format for sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: merge schema use old format in sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: reads legacy format but not convert if skip_auto_convert
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: suppport sparse encoding in bulk parts
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: add datafusion-postgres dependency
* refactor: move and include pg_catalog udfs
* chore: update upstream
* feat: register table function pg_get_keywords
* feat: bridge CatalogInfo for our CatalogManager
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: convert pg_catalog table to our system table
* feat: bridge system catalog with datafusion-postgres
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: add more udfs
* feat: add compatibility rewriter to postgres handler
* fix: various fix
* fmt: fix
* fix: use functions from pg_catalog library
* fmt
* fix: sqlness runner
Signed-off-by: Ning Sun <sunning@greptime.com>
* test: adopt arrow 56.0 to 56.1 memory size change
* fix: add additional udfs
* chore: format
* refactor: return None when creating system table failed
Signed-off-by: Ning Sun <sunning@greptime.com>
* chore: provide safety comments about expect usage
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
chore/unset-tz-env-in-test:
### Commit Message
Add environment variable cleanup in timezone tests
- Updated `timezone.rs` to include removal of the `TZ` environment variable in the `test_from_tz_string` function to ensure a clean test environment.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
fix/disable-parquet-stats-truncate:
- **Update `memcomparable` Dependency**: Switched from crates.io to a Git repository for `memcomparable` in `Cargo.lock`, `mito-codec/Cargo.toml`, and removed it from `mito2/Cargo.toml`.
- **Enhance Parquet Writer Properties**: Added `set_statistics_truncate_length` and `set_column_index_truncate_length` to `WriterProperties` in `parquet.rs`, `bulk/part.rs`, `partition_tree/data.rs`, and `writer.rs`.
- **Add Test for Corrupt Scan**: Introduced a new test module `scan_corrupt.rs` in `mito2/src/engine` to verify handling of corrupt data.
- **Update Test Data**: Modified test data in `flush.rs` to reflect changes in file sizes and sequences.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
### Commit Message
Refactor `get_last_seq_num` Method Across Engines
- **Change Return Type**: Updated the `get_last_seq_num` method to return `Result<SequenceNumber, BoxedError>` instead of `Result<Option<SequenceNumber>, BoxedError>` in the following files:
- `src/datanode/src/tests.rs`
- `src/file-engine/src/engine.rs`
- `src/metric-engine/src/engine.rs`
- `src/metric-engine/src/engine/read.rs`
- `src/mito2/src/engine.rs`
- `src/query/src/optimizer/test_util.rs`
- `src/store-api/src/region_engine.rs`
- **Enhance Region Edit Handling**: Modified `RegionWorkerLoop` in `src/mito2/src/worker/handle_manifest.rs` to update file sequences during region edits.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* add committed_sequence to RegionEdit
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
### Commit Message
Refactor sequence retrieval method
- **Renamed Method**: Changed `get_last_seq_num` to `get_committed_sequence` across multiple files to better reflect its purpose of retrieving the latest committed sequence.
- Affected files: `tests.rs`, `engine.rs` in `file-engine`, `metric-engine`, `mito2`, `test_util.rs`, and `region_engine.rs`.
- **Removed Unused Struct**: Deleted `RegionSequencesRequest` struct from `region_request.rs` as it is no longer needed.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
**Add Committed Sequence Handling in Region Engine**
- **`engine.rs`**: Introduced a new test module `bump_committed_sequence_test` to verify committed sequence handling.
- **`bump_committed_sequence_test.rs`**: Added a test to ensure the committed sequence is correctly updated and persisted across region reopenings.
- **`action.rs`**: Updated `RegionManifest` and `RegionManifestBuilder` to include `committed_sequence` for tracking.
- **`manager.rs`**: Adjusted manifest size assertion to accommodate new committed sequence data.
- **`opener.rs`**: Implemented logic to override committed sequence during region opening.
- **`version.rs`**: Added `set_committed_sequence` method to update the committed sequence in `VersionControl`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
**Enhance `test_bump_committed_sequence` in `bump_committed_sequence_test.rs`**
- Updated the test to include row operations using `build_rows`, `put_rows`, and `rows_schema` to verify the committed sequence behavior.
- Adjusted assertions to reflect changes in committed sequence after row operations and region edits.
- Added comments to clarify the expected behavior of committed sequence after reopening the region and replaying the WAL.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
**Enhance Region Sequence Management**
- **`bump_committed_sequence_test.rs`**: Updated test to handle region reopening and sequence management, ensuring committed sequences are correctly set and verified after edits.
- **`opener.rs`**: Improved committed sequence handling by overriding it only if the manifest's sequence is greater than the replayed sequence. Added logging for mutation sequence replay.
- **`region_write_ctx.rs`**: Modified `push_mutation` and `push_bulk` methods to adopt sequence numbers from parameters, enhancing sequence management during write operations.
- **`handle_write.rs`**: Updated `RegionWorkerLoop` to pass sequence numbers in `push_bulk` and `push_mutation` methods, ensuring consistent sequence handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
### Remove Debug Logging from `opener.rs`
- Removed debug logging for mutation sequences in `opener.rs` to clean up the output and improve performance.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* test: migrate join tests
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: update test results after rebasing main branch
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: unstable query sort results and natural_join test
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: count(*) with joining
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: unstable query sort results and style
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
The allocated and resident metrics were swapped in the set calls. This commit
fixes the issue by ensuring each metric receives its corresponding value.
* feat: support flat format in SeqScan
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support flat format in unordered scan
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support parallel read for flat format in SeqScan
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: rename flat DedupReader to FlatDedupReader
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: address review comments
It also precomputes the input arrow schema
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: add FlushRegionsV2 instruction with unified semantics
- Add FlushRegionsV2 struct supporting both single and batch operations
- Preserve original FlushRegion(RegionId) API for backward compatibility
- Support configurable FlushStrategy (Sync/Async) and FlushErrorStrategy (FailFast/TryAll)
- Add detailed per-region error reporting in FlushRegionReply
- Update datanode handlers to support both legacy and enhanced flush instructions
- Maintain zero breaking changes through automatic conversion of legacy formats
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* chore: run make fmt
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* Apply suggestions from code review
Co-authored-by: LFC <990479+MichaelScofield@users.noreply.github.com>
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* refactor: extract shared perform_region_flush fn
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* refactor: use consistent error type across similar methods
see gh copilot suggestion: https://github.com/GreptimeTeam/greptimedb/pull/6819#discussion_r2299603698
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* chore: make fmt
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* refactor: consolidate FlushRegion instructions
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
---------
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* feat: implements method to write flat batch for ParquetWriter
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: add update method for flat RecordBatch in Indexer
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: calls indexer to write flat batch in ParquetWriter
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: handle empty projection for flat format
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: eval array in precise_filter_flat
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: cache column lookup result in inverted indexer
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: add test
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support dict type in dense codec
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: remove read part in test as it need modifying the reader
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support dictionary type in other methods for dense codec
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: fulltext use string array directly
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore/change-encode-raw-values-sig:
### Update Sparse Encoding to Use Byte Slices
- **`bench_sparse_encoding.rs`**: Modified the `encode_raw_tag_value` function to use byte slices instead of `Bytes` for tag values.
- **`sparse.rs`**: Updated the `encode_raw_tag_value` method in `SparsePrimaryKeyCodec` to accept byte slices (`&[u8]`) instead of `Bytes`. Adjusted related test cases to reflect this change.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/change-encode-raw-values-sig:
### Add `Clear` Trait Implementation for Byte Slices
- Implemented the `Clear` trait for byte slices (`&[u8]`) in `repeated_field.rs` to enhance trait coverage and provide a default clear operation for byte slice types.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: update rate limiter to use semaphore that will block without return error
Signed-off-by: Ning Sun <sunning@greptime.com>
* fix: remove unused error
Signed-off-by: Ning Sun <sunning@greptime.com>
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: support different key type for the dictionary vector
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support more dictionary type in try_into_vector
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: use key array's type as key type
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: cache the cloned page bytes
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: cache the whole row group pages
The opendal reader may merge IO requests so the pages of different
columns can share the same Bytes.
When we use a per-column page cache, the page cache may still referencing
the whole Bytes after eviction if there are other columns in the cache that
share the same Bytes.
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: check possible max byte range and copy pages if needed
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: always copy pages
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: returns the copied pages
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: compute cache size by MERGE_GAP
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: align to buf size
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: aligh to 2MB
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: remove unused code
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: fix typo
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: fix parquet read with cache test
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: Add support for TWCS time window hints in insert operations
Signed-off-by: WenyXu <wenymedia@gmail.com>
* feat: set system events table time window to 1d
Signed-off-by: WenyXu <wenymedia@gmail.com>
---------
Signed-off-by: WenyXu <wenymedia@gmail.com>
* perf/sparse-encoder:
- **Update Dependencies**: Updated `criterion-plot` to version `0.5.0` and added `criterion` version `0.7.0` in `Cargo.lock`. Added `bytes` to `Cargo.toml` in `src/metric-engine`.
- **Benchmarking**: Added a new benchmark for sparse encoding in `bench_sparse_encoding.rs` and updated `Cargo.toml` in `src/mito-codec` to include `criterion` as a dev-dependency.
- **Sparse Encoding Enhancements**: Modified `SparsePrimaryKeyCodec` in `sparse.rs` to include new methods `encode_raw_tag_value` and `encode_internal`. Added public constants `RESERVED_COLUMN_ID_TSID` and `RESERVED_COLUMN_ID_TABLE_ID`.
- **HTTP Server**: Made `try_decompress` function public in `prom_store.rs`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* perf/sparse-encoder:
Improve buffer handling in `sparse.rs`
- Refactored buffer reservation logic to use `value_len` for clarity.
- Optimized chunk processing by calculating `num_chunks` and `remainder` for efficient data handling.
- Enhanced manual serialization of bytes to avoid byte-by-byte operations, improving performance.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* Update src/mito-codec/src/row_converter/sparse.rs
Co-authored-by: Yingwen <realevenyag@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
Co-authored-by: Yingwen <realevenyag@gmail.com>
* test: add failing test for #6791
* test: add support for = and =~
* fix: lint
* fix: code merge issue
Signed-off-by: Ning Sun <sunning@greptime.com>
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* chore: improve error message when there are more than one time index
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: style
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: optimize CreateFlowData with lightweight FlowQueryContext
Replace full QueryContext with FlowQueryContext containing only essential fields (catalog, schema, timezone) in CreateFlowData struct. This improves serialization performance by eliminating unused extensions HashMap and channel fields.
Key changes:
- Add FlowQueryContext struct with conversion implementations
- Update CreateFlowData to use FlowQueryContext with backward compatibility
- Add tests for serialization and conversions
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* chore: run make fmt
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
---------
Signed-off-by: Alex Araujo <alexaraujo@gmail.com>
* refactor: use DataFusion's UDAF implementation directly
Signed-off-by: luofucong <luofc@foxmail.com>
* remove: delete how-to guide for writing aggregate functions
Signed-off-by: luofucong <luofc@foxmail.com>
* fix ci
Signed-off-by: luofucong <luofc@foxmail.com>
* refactor: port json_encode_path to datafusion udaf
Signed-off-by: Ning Sun <sunning@greptime.com>
---------
Signed-off-by: luofucong <luofc@foxmail.com>
Signed-off-by: Ning Sun <sunning@greptime.com>
Co-authored-by: Ning Sun <sunning@greptime.com>
* feat: Implements FlatLastNonNull strategy
Dedup rows and keep last non null fields
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: add basic test for FlatLastNonNull
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: port more last non null test
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: do merge rows after delete op
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: rename num_pk_columns to field_column_start
So we can support different format later
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: address comment
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: add u64 for EqualValue and set expr is true when filter is empty
* Update src/log-query/src/log_query.rs
Co-authored-by: Yingwen <realevenyag@gmail.com>
* chore: update EqualValue Uinit to UInt
---------
Co-authored-by: Yingwen <realevenyag@gmail.com>
chore/impl-cast-to-primitives-for-path-type:
### Add `num_enum` for Enum Conversion and Update `PathType`
- **Added `num_enum` Dependency**: Updated `Cargo.lock` and `Cargo.toml` to include `num_enum` for enum conversion functionality.
- Files: `Cargo.lock`, `src/store-api/Cargo.toml`
- **Enhanced `PathType` Enum**: Implemented `TryFromPrimitive` for `PathType` to enable conversion from primitive types.
- Files: `src/store-api/src/region_request.rs`
- **Added Unit Tests**: Introduced tests to verify the conversion of `PathType` enum to and from primitive types.
- Files: `src/store-api/src/region_request.rs`
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* test: reproduce windows ci issue
* chore: update sqlx
* chore: update pgwire
* chore: update to a debug version of pgwire
* fix: update pgwire to resolve peek after read on windows
* ci: remove windows task from regular ci
refactor: extract the common codes of creating proto ColumnSchema and Row to helper functions
fix: explicitly set the follower max sequence when finding extension ranges to avoid potential concurrency hazard
Signed-off-by: luofucong <luofc@foxmail.com>
* feat: update pgwire api
* feat: update pgwire and override on_query/on_execute
* feat: update pgwire to 0.32
* chore: remove code example
Signed-off-by: Ning Sun <sunning@greptime.com>
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* perf: cached reader do not get page concurrently
Otherwise they will all fetch the same pages in parallel
Signed-off-by: evenyag <realevenyag@gmail.com>
* perf: always disable zstd for bloom
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore/optimize-catalog:
### Add `table_id` Method to `CatalogManager`
- **Files Modified**:
- `src/catalog/src/kvbackend/manager.rs`
- `src/catalog/src/lib.rs`
- **Key Changes**:
- Introduced a new asynchronous method `table_id` in the `CatalogManager` trait to retrieve the table ID based on catalog, schema, and table name.
- Implemented the `table_id` method in `KvBackendCatalogManager` to fetch the table ID from the system catalog or cache, with a fallback to `pg_catalog` for Postgres channels.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/optimize-catalog:
### Add `table_info_by_id` Method to Catalog Managers
- **`manager.rs`**: Introduced the `table_info_by_id` method in `KvBackendCatalogManager` to retrieve table information by table ID using the `TableInfoCacheRef`.
- **`lib.rs`**: Updated the `CatalogManager` trait to include the new `table_info_by_id` method.
- **`memory/manager.rs`**: Implemented the `table_info_by_id` method in `MemoryCatalogManager` to fetch table information by table ID from in-memory catalogs.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: not mark all deleted when partial trunc¬ update manifest when partial file range is empty
Signed-off-by: discord9 <discord9@163.com>
* docs: note
Signed-off-by: discord9 <discord9@163.com>
---------
Signed-off-by: discord9 <discord9@163.com>
* feat: initial support for __schema__ in label values
* feat: filter database with matches
* refactor: skip unnecessary check
* fix: resolve schema matcher in label values
* test: add a test case for table not exists
* refactor: add matchop check on db label
* chore: merge main
fix/compaction-concurrency:
Add delay before compaction in `compaction_test.rs`
- Introduced a 2-millisecond delay using `tokio::time::sleep` before the `compact` function call in `test_compaction_region_with_overlapping_delete_all` to ensure proper timing and synchronization during the test execution.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: add `SET DEFAULT` syntax
Signed-off-by: Yihai Lin <yihai-lin@foxmail.com>
* test: add `CURRENT_TIMESTAMP()` as default value for `SET DEFAULT` syntax
Signed-off-by: Yihai Lin <yihai-lin@foxmail.com>
* refactor: Make the error types more precise.
Signed-off-by: Yihai Lin <yihai-lin@foxmail.com>
* chore: a minor error display enchancement for `SET DEFAULT`
Signed-off-by: Yihai Lin <yihai-lin@foxmail.com>
* refactor: Using `MODIFY COLUMN` for `DROP/SET DEFUALT`
Signed-off-by: Yihai Lin <yihai-lin@foxmail.com>
* chore: update `greptime-proto`
Signed-off-by: Yihai Lin <yihai-lin@foxmail.com>
---------
Signed-off-by: Yihai Lin <yihai-lin@foxmail.com>
* feat: supports more db options
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: tests
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: use btree map for consistent results
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* feat: adds compaction keys into valid db options
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* bulk-multiparts-merge-reader:
**Enhance Memtable Iteration and Flushing Logic**
- **`flush.rs`**: Updated `RegionFlushTask` to handle multiple ranges using `MergeReaderBuilder` for improved source management during flush operations.
- **`memtable.rs`**: Introduced `build_prune_iter` and `build_iter` methods in `MemtableRange` for flexible iteration. Added `MemtableRanges` struct to manage multiple contexts.
- **`simple_bulk_memtable.rs`**: Refactored to use `BatchIterBuilder` and `BatchIterBuilderDeprecated` for iteration, supporting new `read_to_values` method in `Series`.
- **`time_series.rs`**: Added `read_to_values` and `finish_cloned` methods in `Series` and `ValueBuilder` for efficient data handling.
- **`scan_util.rs`**: Replaced `build_iter` with `build_prune_iter` for range iteration, enhancing scan utility.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
- **Add Rayon for Parallel Processing**: Introduced `rayon` for parallel processing in `simple_bulk_memtable.rs` and updated `Cargo.toml` and `Cargo.lock` to include `rayon` dependency.
- **Enhance Benchmarking**: Added new benchmarks in `simple_bulk_memtable.rs` to compare parallel vs sequential processing, projection, sequence filtering, and write performance.
- **Make Structs and Methods Public**: Changed visibility of several structs and methods to `pub` in `simple_bulk_memtable.rs`, `memtable.rs`, `time_series.rs`, and `test_util.rs` to facilitate testing and benchmarking.
- **Update Criterion Features**: Modified `Cargo.toml` to include `html_reports` feature for `criterion`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
### Commit Summary
- **Refactor `SimpleBulkMemtable`**:
- Moved `ranges_sequential` function to a new `test_only` module and made it a method of `SimpleBulkMemtable`.
- Made several fields in `SimpleBulkMemtable` private and added a `region_metadata` getter.
- Affected files: `simple_bulk_memtable.rs`, `test_only.rs`.
- **Benchmark Adjustments**:
- Updated benchmark functions to use the new `ranges_sequential` method.
- Affected file: `simple_bulk_memtable.rs`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
### Add Test Configuration for `iter` Method in Memtable Implementations
- **Enhancements**:
- Added `#[cfg(any(test, feature = "test"))]` attribute to the `iter` method in various `Memtable` implementations to enable conditional compilation for testing purposes.
- Affected files:
- `src/mito2/src/memtable.rs`
- `src/mito2/src/memtable/bulk.rs`
- `src/mito2/src/memtable/partition_tree.rs`
- `src/mito2/src/memtable/simple_bulk_memtable.rs`
- `src/mito2/src/memtable/time_series.rs`
- `src/mito2/src/test_util/memtable_util.rs`
- **Benchmark Adjustments**:
- Removed `black_box` usage in `bench_memtable_write_performance` function to streamline benchmarking.
- Affected file: `src/mito2/benches/simple_bulk_memtable.rs`
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
**Enhance Async Support and Refactor Iteration in `mito2`**
- **Add Async Features**: Updated `Cargo.toml` to include `async` and `async_tokio` features for `criterion`.
- **Async Iteration**: Introduced async functions `flush` and `flush_original` in `simple_bulk_memtable.rs` to handle memtable flushing using async iterators.
- **Refactor Iteration Logic**: Moved `create_iter` and `BatchIterBuilderDeprecated` to `test_only.rs` for better separation of concerns.
- **Public API Change**: Made `next_batch` in `read.rs` public to support async batch processing.
- **Benchmark Updates**: Modified benchmarks in `simple_bulk_memtable.rs` to use async runtime for performance testing.
Files affected: `Cargo.toml`, `simple_bulk_memtable.rs`, `test_only.rs`, `read.rs`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
**Enhance Benchmarking for Memtable**
- Refactored `create_large_memtable` to `create_memtable_with_rows` in `simple_bulk_memtable.rs` to allow dynamic row count configuration.
- Introduced parameterized benchmarking in `bench_ranges_parallel_vs_sequential` to test various row counts, improving the flexibility and coverage of performance tests.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
### Enhance Memory Management and Public API
- **`builder.rs`**: Made `next_offset` method public to allow external access to offset calculations.
- **`simple_bulk_memtable.rs`**: Simplified the `series.extend` method by removing the iterator conversion for `fields`.
- **`time_series.rs`**:
- Added `can_accommodate` method to `ValueBuilder` to check if fields can be accommodated without offset overflow.
- Modified `extend` method to use a `Vec` for `fields` instead of an iterator, improving memory management and error handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
Add License and Enhance Testing in `simple_bulk_memtable.rs`
- Added Apache License header to `simple_bulk_memtable.rs`.
- Modified test configuration in `simple_bulk_memtable.rs` to include `any(test, feature = "test")`.
- Introduced a new test `test_write_read_large_string` in `simple_bulk_memtable.rs` to verify handling of large strings.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
Update `Cargo.toml` dependencies
- Adjust features for `common-meta` and `mito-codec` to include "testing".
- Maintain `criterion` version and features for async support.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
### Update Predicate Type in Memtable Iterators
- **Files Modified**:
- `src/mito2/src/memtable.rs`
- `src/mito2/src/memtable/bulk.rs`
- `src/mito2/src/memtable/simple_bulk_memtable.rs`
- **Key Changes**:
- Updated the `iter` method in `Memtable` trait and its implementations to use `Option<table::predicate::Predicate>` instead of `Option<Predicate>`.
- Adjusted return type in `BulkMemtable`'s `iter` method to `Result<crate::memtable::BoxedBatchIterator>`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
**Enhance Memtable Functionality**
- **`memtable.rs`**:
- Added `Clone` trait to `MemtableStats` and made `num_ranges` public.
- Introduced `num_rows` field in `MemtableRange` and updated its constructor.
- Added `num_rows` method to `MemtableRange`.
- **`partition_tree.rs`, `simple_bulk_memtable.rs`, `time_series.rs`**:
- Updated `MemtableRange` instantiation to include `num_rows`.
- **`range.rs`**:
- Refactored `MemRangeBuilder` to handle a single `MemtableRange` and `MemtableStats`.
- **`scan_region.rs`**:
- Enhanced memtable filtering based on time range and updated `MemRangeBuilder` usage.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
**Enhancements and Bug Fixes**
- **Deduplication Enhancements**:
- Introduced `DedupReader` and `LastRow` as public structs in `dedup.rs` to enhance deduplication capabilities.
- Added `LastNonNull` deduplication strategy in `flush.rs` and `simple_bulk_memtable.rs`.
- **Memtable Improvements**:
- Updated `SimpleBulkMemtable` to support batch size configuration and deduplication strategies.
- Modified `Series` struct in `time_series.rs` to include a configurable capacity.
- **Testing Enhancements**:
- Added new test `test_write_dedup` in `simple_bulk_memtable.rs` to verify deduplication functionality.
- Updated existing tests to include `OpType` parameter for better operation type handling.
- **Refactoring**:
- Renamed `BatchIterBuilder` to `BatchRangeBuilder` in `simple_bulk_memtable.rs` for clarity.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
- **Refactor `flush.rs`:** Removed `LastNonNullIter` usage and adjusted `DedupReader` instantiation to use `LastRow::new(false)` and `LastNonNull::new(false)`.
- **Enhance `simple_bulk_memtable.rs`:** Added logic to handle `LastNonNull` merge mode in `IterBuilder`. Introduced new tests: `test_delete_only` and `test_single_range` to verify delete operations and single range handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: tests
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor: remove staled manifest structures
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* add RegionId to FileId
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* rename method
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* fix test cases
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* fix test
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* refactor: introduce RegionFileId
- FileId still only consist of an uuid
- PathProvider accepts RegionFileId and doesn't need to keep a region id
in it
- All Index applier takes RegionFileId and respects the region id in the RegionFileId
- FileMeta can still derive Serialize/Deserialize
- Refactor the CacheManager to accept RegionFileId
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: define PathType
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: adding PathType WIP
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: fix compiler errors
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: add path_type to region_dir_from_table_dir
Move region_dir_from_table_dir to mito and use join_dir internally
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: set path type to ApplierBuilder
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fmt code
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: fix passing incorrect dir to access layer
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: remove region_dir from CompactionRegion
We can get table_dir and path_type from the access layer
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: fix unit tests
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: fix typo
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: update comment
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: correct marker path
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: use AccessLayer::build_region_dir to get region dir
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: log entries in test
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: set path type in catchup
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: fix test_open_region_failure test
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: fix compiler errors
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: evenyag <realevenyag@gmail.com>
Co-authored-by: Yingwen <realevenyag@gmail.com>
* feat/add-sst-file-num-in-region-stat:
### Add SST File Count to Region Statistics
- **Enhancements**:
- Added `sst_num` to track the number of SST files in region statistics across multiple modules.
- Updated `RegionStat` and `RegionStatistic` structs in `datanode.rs` and `region_engine.rs` to include `sst_num`.
- Modified `MitoRegion` and `SstVersion` in `region.rs` and `version.rs` to compute and return the number of SST files.
- Adjusted test cases in `collect_leader_region_handler.rs`, `failure_handler.rs`, `region_lease_handler.rs`, and `weight_compute.rs` to initialize `sst_num`.
- Updated `get_region_statistic` in `utils.rs` to sum `sst_num` from metadata and data statistics.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/add-sst-file-num-in-region-stat:
Add `sst_num` to `region_statistics`
- Updated `region_statistics.rs` to include a new constant `SST_NUM` and added it to the schema and builder structures.
- Modified `information_schema.result` to reflect the addition of `sst_num` in the `region_statistics` table.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-opendal-dashboard:
### Update Grafana Dashboard Queries
- **Enhanced Metrics Queries**: Updated Prometheus queries in `dashboard.json`, `dashboard.md`, and `dashboard.yaml` files for both `cluster` and `standalone` dashboards to include additional operations (`Reader::read`, `Writer::write`, `Writer::close`) in the metrics calculations.
- **Legend Format Adjustments**: Modified legend formats to include the `operation` field for better clarity in visualizations.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-opendal-dashboard:
Enhance Legend Format in Grafana Dashboards
- Updated the `legendFormat` in `dashboard.json`, `dashboard.md`, and `dashboard.yaml` files for both `cluster` and `standalone` dashboards to include the `operation` field.
- This change affects the following files:
- `grafana/dashboards/metrics/cluster/dashboard.json`
- `grafana/dashboards/metrics/cluster/dashboard.md`
- `grafana/dashboards/metrics/cluster/dashboard.yaml`
- `grafana/dashboards/metrics/standalone/dashboard.json`
- `grafana/dashboards/metrics/standalone/dashboard.md`
- `grafana/dashboards/metrics/standalone/dashboard.yaml`
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: supports null reponse format for http API
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix: license header and assertion
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* chore: in seconds
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
---------
Signed-off-by: Dennis Zhuang <killme2008@gmail.com>
* fix/check-grpc-client-unavailable:
Improve async handling in `greptime_handler.rs`
- Updated the `DoPut` response handling to use `await` with `result_sender.send` for better asynchronous operation.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix/check-grpc-client-unavailable:
### Improve Error Handling in `greptime_handler.rs`
- Enhanced error handling for the `DoPut` operation by switching from `send` to `try_send` for the `result_sender`.
- Added specific logging for unreachable clients, including `request_id` in the warning message.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* wip
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/expose-bulk-symbols:
### Commit Message
Enhance DDL Module Accessibility and Refactor `verify_alter` Function
- **`statement.rs`**: Made the `ddl` module public to enhance accessibility.
- **`ddl.rs`**:
- Made `NAME_PATTERN_REG` public for broader usage.
- Refactored `verify_alter` function to be a standalone public function, improving modularity and reusability.
- Made `parse_partitions` function public to allow external access.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/expose-bulk-symbols:
### Add Parquet Writer and Enhance Row Modifier
- **Add Parquet Writer Module**: Introduced a new module `parquet_writer.rs` to bridge `opendal` `Writer` with `parquet` `AsyncFileWriter`.
- **Enhance Row Modifier**: Updated `RowModifier` to use `Default` trait and made `fill_internal_columns` a public static method in `row_modifier.rs`.
- **Expose Internal Structures**: Made `RowsIter`, `RowIter`, `TablesBuilder`, and `TableBuilder` structs public in `row_modifier.rs` and `prom_row_builder.rs`.
- **Update Metric Engine**: Changed `RowModifier` instantiation to use `default()` in `engine.rs`.
- **Modify Table Options Handling**: Added `fill_table_options_for_create` function in `insert.rs` to handle table options based on `AutoCreateTableType`.
- **Make Constants Public**: Changed `DEFAULT_ROW_GROUP_SIZE` to public in `parquet.rs`.
- **Expose Functions**: Made `extract_add_columns_expr` public in `expr_helper.rs` and `AutoCreateTableType` public in `insert.rs`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/expose-bulk-symbols:
### Commit Message
Enhance HTTP Server and Prometheus Integration
- **`http.rs`**: Made `extractor` module public to allow external access.
- **`prom_store.rs`**: Refactored `decode_remote_write_request` to return `TablesBuilder` and adjusted logic for processing requests based on pipeline usage.
- **`lib.rs`**: Made `metrics` module public for broader accessibility.
- **`prom_row_builder.rs`**: Exposed `tables` field in `TablesBuilder` for external manipulation.
- **`proto.rs`**: Changed visibility of `table_data` in `PromWriteRequest` to `pub(crate)` for internal module access.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/expose-bulk-symbols:
### Add Accessor Methods for Managers and Executors
- **`src/frontend/src/instance.rs`**: Added accessor methods for `NodeManagerRef`, `PartitionRuleManagerRef`, `CacheInvalidatorRef`, and `ProcedureExecutorRef` to the `Instance` struct.
- **`src/operator/src/insert.rs`**: Introduced methods to access `NodeManagerRef` and `PartitionRuleManagerRef` in the `Inserter` struct.
- **`src/operator/src/statement.rs`**: Added methods to retrieve `ProcedureExecutorRef` and `CacheInvalidatorRef` in the `StatementExecutor` struct.
### Change HashMap Implementation
- **`src/servers/src/prom_row_builder.rs`**: Replaced `ahash::HashMap` with `std::collections::HashMap`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/expose-bulk-symbols:
Refactor table option handling in `insert.rs`
- Replaced `Vec` with `HashMap` for `table_options` to improve efficiency.
- Extracted logic for filling table options into a new function `fill_table_options_for_create`.
- Modified `fill_table_options_for_create` to return the engine name based on `create_type`.
- Simplified the insertion of table options into `create_table_expr` by using `extend` method.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* refactor/expose-bulk-symbols:
Refactor `insert.rs` to separate engine name logic from table options
- Updated `Inserter` implementation to determine `engine_name` separately from `fill_table_options_for_create`.
- Modified `fill_table_options_for_create` to no longer return an engine name, focusing solely on populating table options.
- Adjusted logic to set `engine_name` based on `AutoCreateTableType`, using `METRIC_ENGINE_NAME` for logical tables and `default_engine()` otherwise.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
@@ -55,8 +55,12 @@ GreptimeDB uses the [Apache 2.0 license](https://github.com/GreptimeTeam/greptim
- To ensure that community is free and confident in its ability to use your contributions, please sign the Contributor License Agreement (CLA) which will be incorporated in the pull request process.
- Make sure all files have proper license header (running `docker run --rm -v $(pwd):/github/workspace ghcr.io/korandoru/hawkeye-native:v3 format` from the project root).
- Make sure all your codes are formatted and follow the [coding style](https://pingcap.github.io/style-guide/rust/) and [style guide](docs/style-guide.md).
- Make sure all unit tests are passed using [nextest](https://nexte.st/index.html) `cargo nextest run`.
- Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings`).
- Make sure all unit tests are passed using [nextest](https://nexte.st/index.html) `cargo nextest run --workspace --features pg_kvbackend,mysql_kvbackend` or `make test`.
- Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings` or `make clippy`).
- Ensure there are no unused dependencies by running `make check-udeps` (clean them up with `make fix-udeps` if reported).
- If you must keep a target-specific dependency (e.g. under `[target.'cfg(...)'.dev-dependencies]`), add a cargo-udeps ignore entry in the same `Cargo.toml`, for example:
`[package.metadata.cargo-udeps.ignore]` with `development = ["rexpect"]` (or `dependencies`/`build` as appropriate).
- When modifying sample configuration files in `config/`, run `make config-docs` (which requires Docker to be installed) to update the configuration documentation and include it in your commit.
**GreptimeDB** is an open-source, cloud-native database purpose-built for the unified collection and analysis of observability data (metrics, logs, and traces). Whether you’re operating on the edge, in the cloud, or across hybrid environments, GreptimeDB empowers real-time insights at massive scale — all in one system.
**GreptimeDB** is an open-source, cloud-native database that unifies metrics, logs, and traces, enabling real-time observability at any scale — across edge, cloud, and hybrid environments.
## Features
| Feature | Description |
| --------- | ----------- |
| [Unified Observability Data](https://docs.greptime.com/user-guide/concepts/why-greptimedb) | Store metrics, logs, and traces as timestamped, contextual wide events. Query via [SQL](https://docs.greptime.com/user-guide/query-data/sql), [PromQL](https://docs.greptime.com/user-guide/query-data/promql), and [streaming](https://docs.greptime.com/user-guide/flow-computation/overview). |
| [High Performance & Cost Effective](https://docs.greptime.com/user-guide/manage-data/data-index) | Written in Rust, with a distributed query engine, [rich indexing](https://docs.greptime.com/user-guide/manage-data/data-index), and optimized columnar storage, delivering sub-second responses at PB scale. |
| [Cloud-Native Architecture](https://docs.greptime.com/user-guide/concepts/architecture) | Designed for [Kubernetes](https://docs.greptime.com/user-guide/deployments-administration/deploy-on-kubernetes/greptimedb-operator-management), with compute/storage separation, native object storage (AWS S3, Azure Blob, etc.) and seamless cross-cloud access. |
| [Developer-Friendly](https://docs.greptime.com/user-guide/protocols/overview) | Access via SQL/PromQL interfaces, REST API, MySQL/PostgreSQL protocols, and popular ingestion [protocols](https://docs.greptime.com/user-guide/protocols/overview). |
| [Flexible Deployment](https://docs.greptime.com/user-guide/deployments-administration/overview) | Deploy anywhere: edge (including ARM/[Android](https://docs.greptime.com/user-guide/deployments-administration/run-on-android)) or cloud, with unified APIs and efficient data sync. |
| [All-in-One Observability](https://docs.greptime.com/user-guide/concepts/why-greptimedb) | OpenTelemetry-native platform unifying metrics, logs, and traces. Query via [SQL](https://docs.greptime.com/user-guide/query-data/sql), [PromQL](https://docs.greptime.com/user-guide/query-data/promql), and [Flow](https://docs.greptime.com/user-guide/flow-computation/overview). |
| [High Performance](https://docs.greptime.com/user-guide/manage-data/data-index) | Written in Rust with [rich indexing](https://docs.greptime.com/user-guide/manage-data/data-index) (inverted, fulltext, skipping, vector), delivering sub-second responses at PB scale. |
| [Cost Efficiency](https://docs.greptime.com/user-guide/concepts/architecture) | 50x lower operational and storage costs with compute-storage separation and native object storage (S3, Azure Blob, etc.). |
| [Cloud-Native & Scalable](https://docs.greptime.com/user-guide/deployments-administration/deploy-on-kubernetes/greptimedb-operator-management) | Purpose-built for [Kubernetes](https://docs.greptime.com/user-guide/deployments-administration/deploy-on-kubernetes/greptimedb-operator-management) with unlimited cross-cloud scaling, handling hundreds of thousands of concurrent requests. |
| [Developer-Friendly](https://docs.greptime.com/user-guide/protocols/overview) | SQL/PromQL interfaces, built-in web dashboard, REST API, MySQL/PostgreSQL protocol compatibility, and native [OpenTelemetry](https://docs.greptime.com/user-guide/ingest-data/for-observability/opentelemetry/) support. |
| [Flexible Deployment](https://docs.greptime.com/user-guide/deployments-administration/overview) | Deploy anywhere from ARM-based edge devices (including [Android](https://docs.greptime.com/user-guide/deployments-administration/run-on-android)) to cloud, with unified APIs and efficient data sync. |
✅ **Perfect for:**
- Unified observability stack replacing Prometheus + Loki + Tempo
- Large-scale metrics with high cardinality (millions to billions of time series)
- Large-scale observability platform requiring cost efficiency and scalability
- IoT and edge computing with resource and bandwidth constraints
Learn more in [Why GreptimeDB](https://docs.greptime.com/user-guide/concepts/why-greptimedb) and [Observability 2.0 and the Database for It](https://greptime.com/blogs/2025-04-25-greptimedb-observability2-new-database).
@@ -86,10 +92,10 @@ Learn more in [Why GreptimeDB](https://docs.greptime.com/user-guide/concepts/why
* Read the [architecture](https://docs.greptime.com/contributor-guide/overview/#architecture) document.
* [DeepWiki](https://deepwiki.com/GreptimeTeam/greptimedb/1-overview) provides an in-depth look at GreptimeDB:
GreptimeDB can run in two modes:
* **Standalone Mode** - Single binary for development and small deployments
* **Distributed Mode** - Separate components for production scale:
- Frontend: Query processing and protocol handling
- Datanode: Data storage and retrieval
- Metasrv: Metadata management and coordination
Read the [architecture](https://docs.greptime.com/contributor-guide/overview/#architecture) document. [DeepWiki](https://deepwiki.com/GreptimeTeam/greptimedb/1-overview) provides an in-depth look at GreptimeDB:
<imgalt="GreptimeDB System Overview"src="docs/architecture.png">
- **Grafana Data Source**: [GreptimeDB Grafana data source plugin](https://github.com/GreptimeTeam/greptimedb-grafana-datasource)
- **Grafana Dashboard**: [Official Dashboard for monitoring](https://github.com/GreptimeTeam/greptimedb/blob/main/grafana/README.md)
## Project Status
> **Status:** Beta.
> **GA (v1.0):** Targeted for mid 2025.
> **Status:** Beta — marching toward v1.0 GA!
> **GA (v1.0):** January 10, 2026
-Being used in production by early adopters
-Deployed in production by open-source projects and commercial users
- Stable, actively maintained, with regular releases ([version info](https://docs.greptime.com/nightly/reference/about-greptimedb-version))
- Suitable for evaluation and pilot deployments
GreptimeDB v1.0 represents a major milestone toward maturity — marking stable APIs, production readiness, and proven performance.
**Roadmap:** Beta1 (Nov 10) → Beta2 (Nov 24) → RC1 (Dec 8) → GA (Jan 10, 2026), please read [v1.0 highlights and release plan](https://greptime.com/blogs/2025-11-05-greptimedb-v1-highlights) for details.
For production use, we recommend using the latest stable release.
[](https://www.star-history.com/#GreptimeTeam/GreptimeDB&Date)
@@ -214,5 +222,5 @@ Special thanks to all contributors! See [AUTHORS.md](https://github.com/Greptime
| `default_timezone` | String | Unset | The default timezone of the server. |
| `default_column_prefix` | String | Unset | The default column prefix for auto-created time index and value columns. |
| `init_regions_in_background` | Bool | `false` | Initialize all regions in the background during the startup.<br/>By default, it provides services after all regions have been initialized. |
| `max_concurrent_queries` | Integer | `0` | The maximum current queries allowed to be executed. Zero means unlimited. |
| `max_concurrent_queries` | Integer | `0` | The maximum current queries allowed to be executed. Zero means unlimited.<br/>NOTE: This setting affects scan_memory_limit's privileged tier allocation.<br/>When set, 70% of queries get privileged memory access (full scan_memory_limit).<br/>The remaining 30% get standard tier access (70% of scan_memory_limit). |
| `enable_telemetry` | Bool | `true` | Enable telemetry to collect anonymous usage data. Enabled by default. |
| `max_in_flight_write_bytes` | String | Unset | The maximum in-flight write bytes. |
| `runtime` | -- | -- | The runtime options. |
@@ -25,12 +26,15 @@
| `http.addr` | String | `127.0.0.1:4000` | The address to bind the HTTP server. |
| `http.timeout` | String | `0s` | HTTP request timeout. Set to 0 to disable timeout. |
| `http.body_limit` | String | `64MB` | HTTP request body limit.<br/>The following units are supported: `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`.<br/>Set to 0 to disable limit. |
| `http.max_total_body_memory` | String | Unset | Maximum total memory for all concurrent HTTP request bodies.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
| `http.enable_cors` | Bool | `true` | HTTP CORS support, it's turned on by default<br/>This allows browser to access http APIs without CORS restrictions |
| `http.prom_validation_mode` | String | `strict` | Whether to enable validation for Prometheus remote write requests.<br/>Available options:<br/>- strict: deny invalid UTF-8 strings (default).<br/>- lossy: allow invalid UTF-8 strings, replace invalid characters with REPLACEMENT_CHARACTER(U+FFFD).<br/>- unchecked: do not valid strings. |
| `grpc` | -- | -- | The gRPC server options. |
| `grpc.bind_addr` | String | `127.0.0.1:4001` | The address to bind the gRPC server. |
| `grpc.runtime_size` | Integer | `8` | The number of server worker threads. |
| `grpc.max_total_message_memory` | String | Unset | Maximum total memory for all concurrent gRPC request messages.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
| `grpc.max_connection_age` | String | Unset | The maximum connection age for gRPC connection.<br/>The value can be a human-readable time string. For example: `10m` for ten minutes or `1h` for one hour.<br/>Refer to https://grpc.io/docs/guides/keepalive/ for more details. |
| `grpc.tls` | -- | -- | gRPC server TLS options, see `mysql.tls` section. |
| `flow.num_workers` | Integer | `0` | The number of flow worker in flownode.<br/>Not setting(or set to 0) this value will use the number of CPU cores divided by 2. |
| `query` | -- | -- | The query engine options. |
| `query.parallelism` | Integer | `0` | Parallelism of the query engine.<br/>Default to 0, which means the number of CPU cores. |
| `query.memory_pool_size` | String | `50%` | Memory pool size for query execution operators (aggregation, sorting, join).<br/>Supports absolute size (e.g., "2GB", "4GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit (unbounded, default behavior).<br/>When this limit is reached, queries will fail with ResourceExhausted error.<br/>NOTE: This does NOT limit memory used by table scans. |
| `storage` | -- | -- | The data storage options. |
| `storage.data_home` | String | `./greptimedb_data` | The working home directory. |
| `storage.type` | String | `File` | The storage type used to store the data.<br/>- `File`: the data is stored in the local file system.<br/>- `S3`: the data is stored in the S3 object storage.<br/>- `Gcs`: the data is stored in the Google Cloud Storage.<br/>- `Azblob`: the data is stored in the Azure Blob Storage.<br/>- `Oss`: the data is stored in the Aliyun OSS. |
| `storage.enable_read_cache` | Bool | `true` | Whether to enable read cache. If not set, the read cache will be enabled by default when using object storage. |
| `storage.cache_path` | String | Unset | Read cache configuration for object storage such as 'S3' etc, it's configured by default when using object storage. It is recommended to configure it when using object storage for better performance.<br/>A local file directory, defaults to `{data_home}`. An empty string means disabling. |
| `storage.cache_capacity` | String | Unset | The local file cache capacity in bytes. If your disk space is sufficient, it is recommended to set it larger. |
| `storage.bucket` | String | Unset | The S3 bucket name.<br/>**It's only used when the storage type is `S3`, `Oss` and `Gcs`**. |
@@ -145,10 +152,15 @@
| `region_engine.mito.write_cache_path` | String | `""` | File system path for write cache, defaults to `{data_home}`. |
| `region_engine.mito.write_cache_size` | String | `5GiB` | Capacity for write cache. If your disk space is sufficient, it is recommended to set it larger. |
| `region_engine.mito.preload_index_cache` | Bool | `true` | Preload index (puffin) files into cache on region open (default: true).<br/>When enabled, index files are loaded into the write cache during region initialization,<br/>which can improve query performance at the cost of longer startup times. |
| `region_engine.mito.index_cache_percent` | Integer | `20` | Percentage of write cache capacity allocated for index (puffin) files (default: 20).<br/>The remaining capacity is used for data (parquet) files.<br/>Must be between 0 and 100 (exclusive). For example, with a 5GiB write cache and 20% allocation,<br/>1GiB is reserved for index files and 4GiB for data files. |
| `region_engine.mito.parallel_scan_channel_size` | Integer | `32` | Capacity of the channel to send data from parallel scan tasks to the main task. |
| `region_engine.mito.max_concurrent_scan_files` | Integer | `384` | Maximum number of SST files to scan concurrently. |
| `region_engine.mito.allow_stale_entries` | Bool | `false` | Whether to allow stale WAL entries read during replay. |
| `region_engine.mito.scan_memory_limit` | String | `50%` | Memory limit for table scans across all queries.<br/>Supports absolute size (e.g., "2GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit.<br/>NOTE: Works with max_concurrent_queries for tiered memory allocation.<br/>- If max_concurrent_queries is set: 70% of queries get full access, 30% get 70% access.<br/>- If max_concurrent_queries is 0 (unlimited): first 20 queries get full access, rest get 70% access. |
| `region_engine.mito.min_compaction_interval` | String | `0m` | Minimum time interval between two compactions.<br/>To align with the old behavior, the default value is 0 (no restrictions). |
| `region_engine.mito.default_experimental_flat_format` | Bool | `false` | Whether to enable experimental flat format as the default format. |
| `region_engine.mito.index` | -- | -- | The options for index in Mito engine. |
| `region_engine.mito.index.aux_path` | String | `""` | Auxiliary directory path for the index in filesystem, used to store intermediate files for<br/>creating the index and staging files for searching the index, defaults to `{data_home}/index_intermediate`.<br/>The default name for this directory is `index_intermediate` for backward compatibility.<br/><br/>This path contains two subdirectories:<br/>- `__intm`: for storing intermediate files used during creating index.<br/>- `staging`: for storing staging files used during searching index. |
| `region_engine.mito.index.staging_size` | String | `2GB` | The max capacity of the staging directory. |
@@ -180,33 +192,28 @@
| `region_engine.mito.memtable.fork_dictionary_bytes` | String | `1GiB` | Max dictionary bytes.<br/>Only available for `partition_tree` memtable. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThestandalonecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.`greptimedb`)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.self_import` | -- | -- | For `standalone` mode, `self_import` is recommended to collect metrics generated by itself<br/>You must create the database before enabling it. |
| `export_metrics.remote_write.url` | String | `""` | The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
|`memory.enable_heap_profiling`|Bool|`true`|Whethertoenableheapprofilingactivationduringstartup.<br/>When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable<br/>is set to "prof:true,prof_active:false". The official image adds this env variable.<br/>Default is true. |
## Distributed Mode
@@ -216,6 +223,7 @@
| Key | Type | Default | Descriptions |
| --- | -----| ------- | ----------- |
| `default_timezone` | String | Unset | The default timezone of the server. |
| `default_column_prefix` | String | Unset | The default column prefix for auto-created time index and value columns. |
| `max_in_flight_write_bytes` | String | Unset | The maximum in-flight write bytes. |
| `runtime` | -- | -- | The runtime options. |
| `runtime.global_rt_size` | Integer | `8` | The number of threads to execute the runtime for global read operations. |
@@ -227,6 +235,7 @@
| `http.addr` | String | `127.0.0.1:4000` | The address to bind the HTTP server. |
| `http.timeout` | String | `0s` | HTTP request timeout. Set to 0 to disable timeout. |
| `http.body_limit` | String | `64MB` | HTTP request body limit.<br/>The following units are supported: `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`.<br/>Set to 0 to disable limit. |
| `http.max_total_body_memory` | String | Unset | Maximum total memory for all concurrent HTTP request bodies.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
| `http.enable_cors` | Bool | `true` | HTTP CORS support, it's turned on by default<br/>This allows browser to access http APIs without CORS restrictions |
| `http.prom_validation_mode` | String | `strict` | Whether to enable validation for Prometheus remote write requests.<br/>Available options:<br/>- strict: deny invalid UTF-8 strings (default).<br/>- lossy: allow invalid UTF-8 strings, replace invalid characters with REPLACEMENT_CHARACTER(U+FFFD).<br/>- unchecked: do not valid strings. |
@@ -234,17 +243,30 @@
| `grpc.bind_addr` | String | `127.0.0.1:4001` | The address to bind the gRPC server. |
| `grpc.server_addr` | String | `127.0.0.1:4001` | The address advertised to the metasrv, and used for connections from outside the host.<br/>If left empty or unset, the server will automatically use the IP address of the first network interface<br/>on the host, with the same port number as the one specified in `grpc.bind_addr`. |
| `grpc.runtime_size` | Integer | `8` | The number of server worker threads. |
| `grpc.max_total_message_memory` | String | Unset | Maximum total memory for all concurrent gRPC request messages.<br/>Set to 0 to disable the limit. Default: "0" (unlimited) |
| `grpc.flight_compression` | String | `arrow_ipc` | Compression mode for frontend side Arrow IPC service. Available options:<br/>- `none`: disable all compression<br/>- `transport`: only enable gRPC transport compression (zstd)<br/>- `arrow_ipc`: only enable Arrow IPC compression (lz4)<br/>- `all`: enable all compression.<br/>Default to `none` |
| `grpc.max_connection_age` | String | Unset | The maximum connection age for gRPC connection.<br/>The value can be a human-readable time string. For example: `10m` for ten minutes or `1h` for one hour.<br/>Refer to https://grpc.io/docs/guides/keepalive/ for more details. |
| `grpc.tls` | -- | -- | gRPC server TLS options, see `mysql.tls` section. |
| `grpc.tls.watch` | Bool | `false` | Watch for Certificate and key file change and auto reload.<br/>For now, gRPC tls config does not support auto reload. |
| `internal_grpc` | -- | -- | The internal gRPC server options. Internal gRPC port for nodes inside cluster to access frontend. |
| `internal_grpc.bind_addr` | String | `127.0.0.1:4010` | The address to bind the gRPC server. |
| `internal_grpc.server_addr` | String | `127.0.0.1:4010` | The address advertised to the metasrv, and used for connections from outside the host.<br/>If left empty or unset, the server will automatically use the IP address of the first network interface<br/>on the host, with the same port number as the one specified in `grpc.bind_addr`. |
| `internal_grpc.runtime_size` | Integer | `8` | The number of server worker threads. |
| `internal_grpc.flight_compression` | String | `arrow_ipc` | Compression mode for frontend side Arrow IPC service. Available options:<br/>- `none`: disable all compression<br/>- `transport`: only enable gRPC transport compression (zstd)<br/>- `arrow_ipc`: only enable Arrow IPC compression (lz4)<br/>- `all`: enable all compression.<br/>Default to `none` |
| `internal_grpc.tls` | -- | -- | internal gRPC server TLS options, see `mysql.tls` section. |
| `internal_grpc.tls.watch` | Bool | `false` | Watch for Certificate and key file change and auto reload.<br/>For now, gRPC tls config does not support auto reload. |
| `query.parallelism` | Integer | `0` | Parallelism of the query engine.<br/>Default to 0, which means the number of CPU cores. |
| `query.allow_query_fallback` | Bool | `false` | Whether to allow query fallback when push down optimize fails.<br/>Default to false, meaning when push down optimize failed, return error msg |
| `query.memory_pool_size` | String | `50%` | Memory pool size for query execution operators (aggregation, sorting, join).<br/>Supports absolute size (e.g., "4GB", "8GB") or percentage of system memory (e.g., "30%").<br/>Setting it to 0 disables the limit (unbounded, default behavior).<br/>When this limit is reached, queries will fail with ResourceExhausted error.<br/>NOTE: This does NOT limit memory used by table scans (only applies to datanodes). |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`slow_query.record_type`|String|`system_table`|Therecordtypeofslowqueries.Itcanbe`system_table`or`log`.<br/>If `system_table` is selected, the slow queries will be recorded in a system table `greptime_private.slow_queries`.<br/>If `log` is selected, the slow queries will be logged in a log file `greptimedb-slow-queries.*`. |
| `slow_query.threshold` | String | `30s` | The threshold of slow query. It can be human readable time string, for example: `10s`, `100ms`, `1s`. |
| `slow_query.sample_ratio` | Float | `1.0` | The sampling ratio of slow query log. The value should be in the range of (0, 1]. For example, `0.1` means 10% of the slow queries will be logged and `1.0` means all slow queries will be logged. |
| `slow_query.ttl` | String | `30d` | The TTL of the `slow_queries` system table. Default is `30d` when `record_type` is `system_table`. |
| `export_metrics` | -- | -- | The frontend can export its metrics and send to Prometheus compatible service (e.g. `greptimedb` itself) from remote-write API.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.remote_write` | -- | -- | -- |
| `export_metrics.remote_write.url` | String | `""` | The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
| `slow_query.ttl` | String | `90d` | The TTL of the `slow_queries` system table. Default is `90d` when `record_type` is `system_table`. |
| `tracing` | -- | -- | The tracing options. Only effect when compiled with `tokio-console` feature. |
| `tracing.tokio_console_addr` | String | Unset | The tokio console address. |
| `memory` | -- | -- | The memory options. |
| `memory.enable_heap_profiling` | Bool | `true` | Whether to enable heap profiling activation during startup.<br/>When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable<br/>is set to "prof:true,prof_active:false". The official image adds this env variable.<br/>Default is true. |
| `event_recorder` | -- | -- | Configuration options for the event recorder. |
| `event_recorder.ttl` | String | `90d` | TTL for the events table that will be used to store the events. Default is `90d`. |
### Metasrv
@@ -317,10 +340,11 @@
| Key | Type | Default | Descriptions |
| --- | -----| ------- | ----------- |
| `data_home` | String | `./greptimedb_data` | The working home directory. |
| `store_addrs` | Array | -- | Store server address default to etcd store.<br/>For postgres store, the format is:<br/>"password=password dbname=postgres user=postgres host=localhost port=5432"<br/>For etcd store, the format is:<br/>"127.0.0.1:2379" |
| `store_addrs` | Array | -- | Store server address(es). The format depends on the selected backend.<br/><br/>For etcd: a list of "host:port" endpoints.<br/>e.g. ["192.168.1.1:2379", "192.168.1.2:2379"]<br/><br/>For PostgreSQL: a connection string in libpq format or URI.<br/>e.g.<br/>- "host=localhost port=5432 user=postgres password=<PASSWORD> dbname=postgres"<br/>- "postgresql://user:password@localhost:5432/mydb?connect_timeout=10"<br/>The detail see: https://docs.rs/tokio-postgres/latest/tokio_postgres/config/struct.Config.html<br/><br/>For mysql store, the format is a MySQL connection URL.<br/>e.g. "mysql://user:password@localhost:3306/greptime_meta?ssl-mode=VERIFY_CA&ssl-ca=/path/to/ca.pem" |
| `store_key_prefix` | String | `""` | If it's not empty, the metasrv will store all data with this key prefix. |
| `backend` | String | `etcd_store` | The datastore for meta server.<br/>Available values:<br/>- `etcd_store` (default value)<br/>- `memory_store`<br/>- `postgres_store`<br/>- `mysql_store` |
| `meta_table_name` | String | `greptime_metakv` | Table name in RDS to store metadata. Effect when using a RDS kvbackend.<br/>**Only used when backend is `postgres_store`.** |
| `meta_schema_name` | String | `greptime_schema` | Optional PostgreSQL schema for metadata table and election table name qualification.<br/>When PostgreSQL public schema is not writable (e.g., PostgreSQL 15+ with restricted public),<br/>set this to a writable schema. GreptimeDB will use `meta_schema_name`.`meta_table_name`.<br/>GreptimeDB will NOT create the schema automatically; please ensure it exists or the user has permission.<br/>**Only used when backend is `postgres_store`.** |
| `meta_election_lock_id` | Integer | `1` | Advisory lock id in PostgreSQL for election. Effect when using PostgreSQL as kvbackend<br/>Only used when backend is `postgres_store`. |
| `use_memory_store` | Bool | `false` | Store data in memory. |
@@ -332,6 +356,11 @@
| `runtime` | -- | -- | The runtime options. |
| `runtime.global_rt_size` | Integer | `8` | The number of threads to execute the runtime for global read operations. |
| `runtime.compact_rt_size` | Integer | `4` | The number of threads to execute the runtime for global write operations. |
| `backend_tls` | -- | -- | TLS configuration for kv store backend (applicable for etcd, PostgreSQL, and MySQL backends)<br/>When using etcd, PostgreSQL, or MySQL as metadata store, you can configure TLS here<br/><br/>Note: if TLS is configured in both this section and the `store_addrs` connection string, the<br/>settings here will override the TLS settings in `store_addrs`. |
| `backend_tls.mode` | String | `prefer` | TLS mode, refer to https://www.postgresql.org/docs/current/libpq-ssl.html<br/>- "disable" - No TLS<br/>- "prefer" (default) - Try TLS, fallback to plain<br/>- "require" - Require TLS<br/>- "verify_ca" - Require TLS and verify CA<br/>- "verify_full" - Require TLS and verify hostname |
| `backend_tls.ca_cert_path` | String | `""` | Path to CA certificate file (for server certificate verification)<br/>Required when using custom CAs or self-signed certificates<br/>Leave empty to use system root certificates only<br/>Like "/path/to/ca.crt" |
| `grpc` | -- | -- | The gRPC server options. |
| `grpc.bind_addr` | String | `127.0.0.1:3002` | The address to bind the gRPC server. |
| `grpc.server_addr` | String | `127.0.0.1:3002` | The communication server address for the frontend and datanode to connect to metasrv.<br/>If left empty or unset, the server will automatically use the IP address of the first network interface<br/>on the host, with the same port number as the one specified in `bind_addr`. |
@@ -348,10 +377,9 @@
| `procedure.max_metadata_value_size` | String | `1500KiB` | Auto split large value<br/>GreptimeDB procedure uses etcd as the default metadata storage backend.<br/>The etcd the maximum size of any request is 1.5 MiB<br/>1500KiB = 1536KiB (1.5MiB) - 36KiB (reserved size of key)<br/>Comments out the `max_metadata_value_size`, for don't split large value (no limit). |
| `procedure.max_running_procedures` | Integer | `128` | Max running procedures.<br/>The maximum number of procedures that can be running at the same time.<br/>If the number of running procedures exceeds this limit, the procedure will be rejected. |
| `failure_detector` | -- | -- | -- |
| `failure_detector.threshold` | Float | `8.0` | The threshold value used by the failure detector to determine failure conditions. |
| `failure_detector.min_std_deviation` | String | `100ms` | The minimum standard deviation of the heartbeat intervals, used to calculate acceptable variations. |
| `failure_detector.acceptable_heartbeat_pause` | String | `10000ms` | The acceptable pause duration between heartbeats, used to determine if a heartbeat interval is acceptable. |
| `failure_detector.first_heartbeat_estimate` | String | `1000ms` | The initial estimate of the heartbeat interval used by the failure detector. |
| `failure_detector.threshold` | Float | `8.0` | Maximum acceptable φ before the peer is treated as failed.<br/>Lower values react faster but yield more false positives. |
| `failure_detector.min_std_deviation` | String | `100ms` | The minimum standard deviation of the heartbeat intervals.<br/>So tiny variations don’t make φ explode. Prevents hypersensitivity when heartbeat intervals barely vary. |
| `failure_detector.acceptable_heartbeat_pause` | String | `10000ms` | The acceptable pause duration between heartbeats.<br/>Additional extra grace period to the learned mean interval before φ rises, absorbing temporary network hiccups or GC pauses. |
| `wal.broker_endpoints` | Array | -- | The broker endpoints of the Kafka cluster. |
| `wal.auto_create_topics` | Bool | `true` | Automatically create topics for WAL.<br/>Set to `true` to automatically create topics for WAL.<br/>Otherwise, use topics named `topic_name_prefix_[0..num_topics)` |
| `wal.auto_prune_interval` | String | `0s` | Interval of automatically WAL pruning.<br/>Set to `0s` to disable automatically WAL pruning which delete unused remote WAL entries periodically. |
| `wal.trigger_flush_threshold` | Integer | `0` | The threshold to trigger a flush operation of a region in automatically WAL pruning.<br/>Metasrv will send a flushrequest to flush the region when:<br/>`trigger_flush_threshold` + `prunable_entry_id`<`max_prunable_entry_id`<br/>where:<br/>- `prunable_entry_id` is the maximum entry id that can be pruned of the region.<br/>- `max_prunable_entry_id` is the maximum prunable entry id among all regions in the same topic.<br/>Set to `0` to disable the flush operation. |
| `wal.topic_name_prefix` | String | `greptimedb_wal_topic` | A Kafka topic is constructed by concatenating `topic_name_prefix` and `topic_id`.<br/>Only accepts strings that match the following regular expression pattern:<br/>[a-zA-Z_:-][a-zA-Z0-9_:\-\.@#]*<br/>i.g., greptimedb_wal_topic_0, greptimedb_wal_topic_1. |
| `wal.replication_factor` | Integer | `1` | Expected number of replicas of each partition. |
| `wal.create_topic_timeout` | String | `30s` | Above which a topic creation operation will be cancelled. |
| `wal.broker_endpoints` | Array | -- | The broker endpoints of the Kafka cluster.<br/><br/>**It's only used when the provider is `kafka`**. |
| `wal.auto_create_topics` | Bool | `true` | Automatically create topics for WAL.<br/>Set to `true` to automatically create topics for WAL.<br/>Otherwise, use topics named `topic_name_prefix_[0..num_topics)`<br/>**It's only used when the provider is `kafka`**. |
| `wal.auto_prune_interval` | String | `30m` | Interval of automatically WAL pruning.<br/>Set to `0s` to disable automatically WAL pruning which delete unused remote WAL entries periodically.<br/>**It's only used when the provider is `kafka`**. |
| `wal.flush_trigger_size` | String | `512MB` | Estimated size threshold to trigger a flush when using Kafka remote WAL.<br/>Since multiple regions may share a Kafka topic, the estimated size is calculated as:<br/> (latest_entry_id - flushed_entry_id) * avg_record_size<br/>MetaSrv triggers a flush for a region when this estimated size exceeds `flush_trigger_size`.<br/>- `latest_entry_id`: The latest entry ID in the topic.<br/>- `flushed_entry_id`: The last flushed entry ID for the region.<br/>Set to "0" to let the system decide the flush trigger size.<br/>**It's only used when the provider is `kafka`**. |
| `wal.checkpoint_trigger_size` | String | `128MB` | Estimated size threshold to trigger a checkpoint when using Kafka remote WAL.<br/>The estimated size is calculated as:<br/> (latest_entry_id - last_checkpoint_entry_id) * avg_record_size<br/>MetaSrv triggers a checkpoint for a region when this estimated size exceeds `checkpoint_trigger_size`.<br/>Set to "0" to let the system decide the checkpoint trigger size.<br/>**It's only used when the provider is `kafka`**. |
| `wal.auto_prune_parallelism` | Integer | `10` | Concurrent task limit for automatically WAL pruning.<br/>**It's only used when the provider is `kafka`**. |
| `wal.num_topics` | Integer | `64` | Number of topics used for remote WAL.<br/>**It's only used when the provider is `kafka`**. |
| `wal.selector_type` | String | `round_robin` | Topic selector type.<br/>Available selector types:<br/>- `round_robin` (default)<br/>**It's only used when the provider is `kafka`**. |
| `wal.topic_name_prefix` | String | `greptimedb_wal_topic` | A Kafka topic is constructed by concatenating `topic_name_prefix` and `topic_id`.<br/>Only accepts strings that match the following regular expression pattern:<br/>[a-zA-Z_:-][a-zA-Z0-9_:\-\.@#]*<br/>i.g., greptimedb_wal_topic_0, greptimedb_wal_topic_1.<br/>**It's only used when the provider is `kafka`**. |
| `wal.replication_factor` | Integer | `1` | Expected number of replicas of each partition.<br/>**It's only used when the provider is `kafka`**. |
| `wal.create_topic_timeout` | String | `30s` | The timeout for creating a Kafka topic.<br/>**It's only used when the provider is `kafka`**. |
| `event_recorder` | -- | -- | Configuration options for the event recorder. |
| `event_recorder.ttl` | String | `90d` | TTL for the events table that will be used to store the events. Default is `90d`. |
| `stats_persistence` | -- | -- | Configuration options for the stats persistence. |
| `stats_persistence.ttl` | String | `0s` | TTL for the stats table that will be used to store the stats.<br/>Set to `0s` to disable stats persistence.<br/>Default is `0s`.<br/>If you want to enable stats persistence, set the TTL to a value greater than 0.<br/>It is recommended to set a small value, e.g., `3h`. |
| `stats_persistence.interval` | String | `10m` | The interval to persist the stats. Default is `10m`.<br/>The minimum value is `10m`, if the value is less than `10m`, it will be overridden to `10m`. |
| `logging` | -- | -- | The logging options. |
| `logging.dir` | String | `./greptimedb_data/logs` | The directory to store the log files. If set to empty, logs will not be written to files. |
| `logging.level` | String | Unset | The log level. Can be `info`/`debug`/`warn`/`error`. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThemetasrvcanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.remote_write` | -- | -- | -- |
| `export_metrics.remote_write.url` | String | `""` | The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
|`memory.enable_heap_profiling`|Bool|`true`|Whethertoenableheapprofilingactivationduringstartup.<br/>When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable<br/>is set to "prof:true,prof_active:false". The official image adds this env variable.<br/>Default is true. |
### Datanode
@@ -395,10 +426,11 @@
| Key | Type | Default | Descriptions |
| --- | -----| ------- | ----------- |
| `node_id` | Integer | Unset | The datanode identifier and should be unique in the cluster. |
| `default_column_prefix` | String | Unset | The default column prefix for auto-created time index and value columns. |
| `require_lease_before_startup` | Bool | `false` | Start services after regions have obtained leases.<br/>It will block the datanode start if it can't receive leases in the heartbeat from metasrv. |
| `init_regions_in_background` | Bool | `false` | Initialize all regions in the background during the startup.<br/>By default, it provides services after all regions have been initialized. |
| `max_concurrent_queries` | Integer | `0` | The maximum current queries allowed to be executed. Zero means unlimited. |
| `max_concurrent_queries` | Integer | `0` | The maximum current queries allowed to be executed. Zero means unlimited.<br/>NOTE: This setting affects scan_memory_limit's privileged tier allocation.<br/>When set, 70% of queries get privileged memory access (full scan_memory_limit).<br/>The remaining 30% get standard tier access (70% of scan_memory_limit). |
| `enable_telemetry` | Bool | `true` | Enable telemetry to collect anonymous usage data. Enabled by default. |
| `http` | -- | -- | The HTTP server options. |
| `http.addr` | String | `127.0.0.1:4000` | The address to bind the HTTP server. |
@@ -433,7 +465,7 @@
| `meta_client.metadata_cache_ttl` | String | `10m` | TTL of the metadata cache. |
| `wal.provider` | String | `raft_engine` | The provider of the WAL.<br/>- `raft_engine`: the wal is stored in the local file system by raft-engine.<br/>- `kafka`: it's remote wal that data is stored in Kafka. |
| `wal.provider` | String | `raft_engine` | The provider of the WAL.<br/>- `raft_engine`: the wal is stored in the local file system by raft-engine.<br/>- `kafka`: it's remote wal that data is stored in Kafka.<br/>- `noop`: it's a no-op WAL provider that does not store any WAL data.<br/>**Notes: any unflushed data will be lost when the datanode is shutdown.** |
| `wal.dir` | String | Unset | The directory to store the WAL files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.file_size` | String | `128MB` | The size of the WAL segment file.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_threshold` | String | `1GB` | The threshold of the WAL size to trigger a purge.<br/>**It's only used when the provider is `raft_engine`**. |
@@ -452,10 +484,12 @@
| `wal.overwrite_entry_start_id` | Bool | `false` | Ignore missing entries during read WAL.<br/>**It's only used when the provider is `kafka`**.<br/><br/>This option ensures that when Kafka messages are deleted, the system<br/>can still successfully replay memtable data without throwing an<br/>out-of-range error.<br/>However, enabling this option might lead to unexpected data loss,<br/>as the system will skip over missing entries instead of treating<br/>them as critical errors. |
| `query` | -- | -- | The query engine options. |
| `query.parallelism` | Integer | `0` | Parallelism of the query engine.<br/>Default to 0, which means the number of CPU cores. |
| `query.memory_pool_size` | String | `50%` | Memory pool size for query execution operators (aggregation, sorting, join).<br/>Supports absolute size (e.g., "2GB", "4GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit (unbounded, default behavior).<br/>When this limit is reached, queries will fail with ResourceExhausted error.<br/>NOTE: This does NOT limit memory used by table scans. |
| `storage` | -- | -- | The data storage options. |
| `storage.data_home` | String | `./greptimedb_data` | The working home directory. |
| `storage.type` | String | `File` | The storage type used to store the data.<br/>- `File`: the data is stored in the local file system.<br/>- `S3`: the data is stored in the S3 object storage.<br/>- `Gcs`: the data is stored in the Google Cloud Storage.<br/>- `Azblob`: the data is stored in the Azure Blob Storage.<br/>- `Oss`: the data is stored in the Aliyun OSS. |
| `storage.cache_path` | String | Unset | Read cache configuration for object storage such as 'S3' etc, it's configured by default when using object storage. It is recommended to configure it when using object storage for better performance.<br/>A local file directory, defaults to `{data_home}`. An empty string means disabling. |
| `storage.enable_read_cache` | Bool | `true` | Whether to enable read cache. If not set, the read cache will be enabled by default when using object storage. |
| `storage.cache_capacity` | String | Unset | The local file cache capacity in bytes. If your disk space is sufficient, it is recommended to set it larger. |
| `storage.bucket` | String | Unset | The S3 bucket name.<br/>**It's only used when the storage type is `S3`, `Oss` and `Gcs`**. |
| `storage.root` | String | Unset | The S3 data will be stored in the specified prefix, for example, `s3://${bucket}/${root}`.<br/>**It's only used when the storage type is `S3`, `Oss` and `Azblob`**. |
@@ -483,6 +517,8 @@
| `region_engine.mito.worker_channel_size` | Integer | `128` | Request channel size of each worker. |
| `region_engine.mito.worker_request_batch_size` | Integer | `64` | Max batch size for a worker to handle requests. |
| `region_engine.mito.manifest_checkpoint_distance` | Integer | `10` | Number of meta action updated to trigger a new checkpoint for the manifest. |
| `region_engine.mito.experimental_manifest_keep_removed_file_count` | Integer | `256` | Number of removed files to keep in manifest's `removed_files` field before also<br/>remove them from `removed_files`. Mostly for debugging purpose.<br/>If set to 0, it will only use `keep_removed_file_ttl` to decide when to remove files<br/>from `removed_files` field. |
| `region_engine.mito.experimental_manifest_keep_removed_file_ttl` | String | `1h` | How long to keep removed files in the `removed_files` field of manifest<br/>after they are removed from manifest.<br/>files will only be removed from `removed_files` field<br/>if both `keep_removed_file_count` and `keep_removed_file_ttl` is reached. |
| `region_engine.mito.compress_manifest` | Bool | `false` | Whether to compress manifest and checkpoint file by gzip (default false). |
| `region_engine.mito.max_background_flushes` | Integer | Auto | Max number of running background flush jobs (default: 1/2 of cpu cores). |
| `region_engine.mito.max_background_compactions` | Integer | Auto | Max number of running background compaction jobs (default: 1/4 of cpu cores). |
@@ -498,10 +534,15 @@
| `region_engine.mito.write_cache_path` | String | `""` | File system path for write cache, defaults to `{data_home}`. |
| `region_engine.mito.write_cache_size` | String | `5GiB` | Capacity for write cache. If your disk space is sufficient, it is recommended to set it larger. |
| `region_engine.mito.preload_index_cache` | Bool | `true` | Preload index (puffin) files into cache on region open (default: true).<br/>When enabled, index files are loaded into the write cache during region initialization,<br/>which can improve query performance at the cost of longer startup times. |
| `region_engine.mito.index_cache_percent` | Integer | `20` | Percentage of write cache capacity allocated for index (puffin) files (default: 20).<br/>The remaining capacity is used for data (parquet) files.<br/>Must be between 0 and 100 (exclusive). For example, with a 5GiB write cache and 20% allocation,<br/>1GiB is reserved for index files and 4GiB for data files. |
| `region_engine.mito.parallel_scan_channel_size` | Integer | `32` | Capacity of the channel to send data from parallel scan tasks to the main task. |
| `region_engine.mito.max_concurrent_scan_files` | Integer | `384` | Maximum number of SST files to scan concurrently. |
| `region_engine.mito.allow_stale_entries` | Bool | `false` | Whether to allow stale WAL entries read during replay. |
| `region_engine.mito.scan_memory_limit` | String | `50%` | Memory limit for table scans across all queries.<br/>Supports absolute size (e.g., "2GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit.<br/>NOTE: Works with max_concurrent_queries for tiered memory allocation.<br/>- If max_concurrent_queries is set: 70% of queries get full access, 30% get 70% access.<br/>- If max_concurrent_queries is 0 (unlimited): first 20 queries get full access, rest get 70% access. |
| `region_engine.mito.min_compaction_interval` | String | `0m` | Minimum time interval between two compactions.<br/>To align with the old behavior, the default value is 0 (no restrictions). |
| `region_engine.mito.default_experimental_flat_format` | Bool | `false` | Whether to enable experimental flat format as the default format. |
| `region_engine.mito.index` | -- | -- | The options for index in Mito engine. |
| `region_engine.mito.index.aux_path` | String | `""` | Auxiliary directory path for the index in filesystem, used to store intermediate files for<br/>creating the index and staging files for searching the index, defaults to `{data_home}/index_intermediate`.<br/>The default name for this directory is `index_intermediate` for backward compatibility.<br/><br/>This path contains two subdirectories:<br/>- `__intm`: for storing intermediate files used during creating index.<br/>- `staging`: for storing staging files used during searching index. |
| `region_engine.mito.index.staging_size` | String | `2GB` | The max capacity of the staging directory. |
@@ -533,26 +574,23 @@
| `region_engine.mito.memtable.fork_dictionary_bytes` | String | `1GiB` | Max dictionary bytes.<br/>Only available for `partition_tree` memtable. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.remote_write` | -- | -- | -- |
| `export_metrics.remote_write.url` | String | `""` | The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
|`memory.enable_heap_profiling`|Bool|`true`|Whethertoenableheapprofilingactivationduringstartup.<br/>When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable<br/>is set to "prof:true,prof_active:false". The official image adds this env variable.<br/>Default is true. |
### Flownode
@@ -562,6 +600,22 @@
| `node_id` | Integer | Unset | The flownode identifier and should be unique in the cluster. |
| `flow` | -- | -- | flow engine options. |
| `flow.num_workers` | Integer | `0` | The number of flow worker in flownode.<br/>Not setting(or set to 0) this value will use the number of CPU cores divided by 2. |
| `flow.batching_mode` | -- | -- | -- |
| `flow.batching_mode.query_timeout` | String | `600s` | The default batching engine query timeout is 10 minutes. |
| `flow.batching_mode.slow_query_threshold` | String | `60s` | will output a warn log for any query that runs for more that this threshold |
| `flow.batching_mode.experimental_min_refresh_duration` | String | `5s` | The minimum duration between two queries execution by batching mode task |
| `flow.batching_mode.experimental_grpc_max_retries` | Integer | `3` | The gRPC max retry number |
| `flow.batching_mode.experimental_frontend_scan_timeout` | String | `30s` | Flow wait for available frontend timeout,<br/>if failed to find available frontend after frontend_scan_timeout elapsed, return error<br/>which prevent flownode from starting |
| `flow.batching_mode.experimental_frontend_activity_timeout` | String | `60s` | Frontend activity timeout<br/>if frontend is down(not sending heartbeat) for more than frontend_activity_timeout,<br/>it will be removed from the list that flownode use to connect |
| `flow.batching_mode.experimental_max_filter_num_per_query` | Integer | `20` | Maximum number of filters allowed in a single query |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.log_format` | String | `text` | The log format. Can be `text`/`json`. |
| `logging.max_log_files` | Integer | `720` | The maximum amount of log files. |
| `logging.otlp_export_protocol` | String | `http` | The OTLP tracing export protocol. Can be `grpc`/`http`. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
| `logging.otlp_headers` | -- | -- | Additional OTLP headers, only valid when using OTLP http |
| `logging.tracing_sample_ratio` | -- | Unset | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`query.parallelism`|Integer|`1`|Parallelismofthequeryengineforquerysentbyflownode.<br/>Default to 1, so it won't use too much cpu or memory |
| `query.memory_pool_size` | String | `50%` | Memory pool size for query execution operators (aggregation, sorting, join).<br/>Supports absolute size (e.g., "1GB", "2GB") or percentage of system memory (e.g., "20%").<br/>Setting it to 0 disables the limit (unbounded, default behavior).<br/>When this limit is reached, queries will fail with ResourceExhausted error.<br/>NOTE: This does NOT limit memory used by table scans. |
| `memory` | -- | -- | The memory options. |
| `memory.enable_heap_profiling` | Bool | `true` | Whether to enable heap profiling activation during startup.<br/>When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable<br/>is set to "prof:true,prof_active:false". The official image adds this env variable.<br/>Default is true. |
## Whether to enable the experimental sparse primary key encoding.
experimental_sparse_primary_key_encoding=false
## Whether to use sparse primary key encoding.
sparse_primary_key_encoding=true
## The logging options.
[logging]
@@ -629,7 +685,7 @@ level = "info"
enable_otlp_tracing=false
## The OTLP tracing endpoint.
otlp_endpoint="http://localhost:4318"
otlp_endpoint="http://localhost:4318/v1/traces"
## Whether to append logs to stdout.
append_stdout=true
@@ -643,29 +699,29 @@ max_log_files = 720
## The OTLP tracing export protocol. Can be `grpc`/`http`.
otlp_export_protocol="http"
## Additional OTLP headers, only valid when using OTLP http
[logging.otlp_headers]
## @toml2docs:none-default
#Authorization = "Bearer my-token"
## @toml2docs:none-default
#Database = "My database"
## The percentage of tracing will be sampled and exported.
## Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.
## ratio > 1 are treated as 1. Fractions < 0 are treated as 0
[logging.tracing_sample_ratio]
default_ratio=1.0
## The datanode can export its metrics and send to Prometheus compatible service (e.g. `greptimedb` itself) from remote-write API.
## This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
[export_metrics]
## whether enable export metrics.
enable=false
## The interval of export metrics.
write_interval="30s"
[export_metrics.remote_write]
## The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`.
url=""
## HTTP headers of Prometheus remote-write carry.
headers={}
## The tracing options. Only effect when compiled with `tokio-console` feature.
#+ [tracing]
## The tokio console address.
## @toml2docs:none-default
#+ tokio_console_addr = "127.0.0.1"
## The memory options.
[memory]
## Whether to enable heap profiling activation during startup.
## When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable
## is set to "prof:true,prof_active:false". The official image adds this env variable.
## The default column prefix for auto-created time index and value columns.
## @toml2docs:none-default
default_column_prefix="greptime"
## The maximum in-flight write bytes.
## @toml2docs:none-default
#+ max_in_flight_write_bytes = "500MB"
@@ -31,6 +35,10 @@ timeout = "0s"
## The following units are supported: `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`.
## Set to 0 to disable limit.
body_limit="64MB"
## Maximum total memory for all concurrent HTTP request bodies.
## Set to 0 to disable the limit. Default: "0" (unlimited)
## @toml2docs:none-default
#+ max_total_body_memory = "1GB"
## HTTP CORS support, it's turned on by default
## This allows browser to access http APIs without CORS restrictions
enable_cors=true
@@ -54,6 +62,10 @@ bind_addr = "127.0.0.1:4001"
server_addr="127.0.0.1:4001"
## The number of server worker threads.
runtime_size=8
## Maximum total memory for all concurrent gRPC request messages.
## Set to 0 to disable the limit. Default: "0" (unlimited)
## @toml2docs:none-default
#+ max_total_message_memory = "1GB"
## Compression mode for frontend side Arrow IPC service. Available options:
## - `none`: disable all compression
## - `transport`: only enable gRPC transport compression (zstd)
@@ -61,6 +73,11 @@ runtime_size = 8
## - `all`: enable all compression.
## Default to `none`
flight_compression="arrow_ipc"
## The maximum connection age for gRPC connection.
## The value can be a human-readable time string. For example: `10m` for ten minutes or `1h` for one hour.
## Refer to https://grpc.io/docs/guides/keepalive/ for more details.
## @toml2docs:none-default
#+ max_connection_age = "10m"
## gRPC server TLS options, see `mysql.tls` section.
[grpc.tls]
@@ -79,6 +96,42 @@ key_path = ""
## For now, gRPC tls config does not support auto reload.
watch=false
## The internal gRPC server options. Internal gRPC port for nodes inside cluster to access frontend.
[internal_grpc]
## The address to bind the gRPC server.
bind_addr="127.0.0.1:4010"
## The address advertised to the metasrv, and used for connections from outside the host.
## If left empty or unset, the server will automatically use the IP address of the first network interface
## on the host, with the same port number as the one specified in `grpc.bind_addr`.
server_addr="127.0.0.1:4010"
## The number of server worker threads.
runtime_size=8
## Compression mode for frontend side Arrow IPC service. Available options:
## - `none`: disable all compression
## - `transport`: only enable gRPC transport compression (zstd)
## - `arrow_ipc`: only enable Arrow IPC compression (lz4)
## - `all`: enable all compression.
## Default to `none`
flight_compression="arrow_ipc"
## internal gRPC server TLS options, see `mysql.tls` section.
[internal_grpc.tls]
## TLS mode.
mode="disable"
## Certificate file path.
## @toml2docs:none-default
cert_path=""
## Private key file path.
## @toml2docs:none-default
key_path=""
## Watch for Certificate and key file change and auto reload.
## For now, gRPC tls config does not support auto reload.
watch=false
## MySQL server options.
[mysql]
## Whether to enable.
@@ -90,6 +143,8 @@ runtime_size = 2
## Server-side keep-alive time.
## Set to 0 (default) to disable.
keep_alive="0s"
## Maximum entries in the MySQL prepared statement cache; default is 10,000.
prepared_stmt_cache_size=10000
# MySQL server TLS options.
[mysql.tls]
@@ -197,6 +252,16 @@ metadata_cache_tti = "5m"
## Parallelism of the query engine.
## Default to 0, which means the number of CPU cores.
parallelism=0
## Whether to allow query fallback when push down optimize fails.
## Default to false, meaning when push down optimize failed, return error msg
allow_query_fallback=false
## Memory pool size for query execution operators (aggregation, sorting, join).
## Supports absolute size (e.g., "4GB", "8GB") or percentage of system memory (e.g., "30%").
## Setting it to 0 disables the limit (unbounded, default behavior).
## When this limit is reached, queries will fail with ResourceExhausted error.
## NOTE: This does NOT limit memory used by table scans (only applies to datanodes).
memory_pool_size="50%"
## Datanode options.
[datanode]
@@ -218,7 +283,7 @@ level = "info"
enable_otlp_tracing=false
## The OTLP tracing endpoint.
otlp_endpoint="http://localhost:4318"
otlp_endpoint="http://localhost:4318/v1/traces"
## Whether to append logs to stdout.
append_stdout=true
@@ -232,6 +297,13 @@ max_log_files = 720
## The OTLP tracing export protocol. Can be `grpc`/`http`.
otlp_export_protocol="http"
## Additional OTLP headers, only valid when using OTLP http
[logging.otlp_headers]
## @toml2docs:none-default
#Authorization = "Bearer my-token"
## @toml2docs:none-default
#Database = "My database"
## The percentage of tracing will be sampled and exported.
## Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.
## ratio > 1 are treated as 1. Fractions < 0 are treated as 0
@@ -254,26 +326,24 @@ threshold = "30s"
## The sampling ratio of slow query log. The value should be in the range of (0, 1]. For example, `0.1` means 10% of the slow queries will be logged and `1.0` means all slow queries will be logged.
sample_ratio=1.0
## The TTL of the `slow_queries` system table. Default is `30d` when `record_type` is `system_table`.
ttl="30d"
## The frontend can export its metrics and send to Prometheus compatible service (e.g. `greptimedb` itself) from remote-write API.
## This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
[export_metrics]
## whether enable export metrics.
enable=false
## The interval of export metrics.
write_interval="30s"
[export_metrics.remote_write]
## The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`.
url=""
## HTTP headers of Prometheus remote-write carry.
headers={}
## The TTL of the `slow_queries` system table. Default is `90d` when `record_type` is `system_table`.
ttl="90d"
## The tracing options. Only effect when compiled with `tokio-console` feature.
#+ [tracing]
## The tokio console address.
## @toml2docs:none-default
#+ tokio_console_addr = "127.0.0.1"
## The memory options.
[memory]
## Whether to enable heap profiling activation during startup.
## When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable
## is set to "prof:true,prof_active:false". The official image adds this env variable.
## Default is true.
enable_heap_profiling=true
## Configuration options for the event recorder.
[event_recorder]
## TTL for the events table that will be used to store the events. Default is `90d`.
## **It's only used when the provider is `kafka`**.
topic_name_prefix="greptimedb_wal_topic"
## Expected number of replicas of each partition.
## **It's only used when the provider is `kafka`**.
replication_factor=1
## Above which a topic creation operation will be cancelled.
## The timeout for creating a Kafka topic.
## **It's only used when the provider is `kafka`**.
create_topic_timeout="30s"
# The Kafka SASL configuration.
@@ -212,6 +273,23 @@ create_topic_timeout = "30s"
# client_cert_path = "/path/to/client_cert"
# client_key_path = "/path/to/key"
## Configuration options for the event recorder.
[event_recorder]
## TTL for the events table that will be used to store the events. Default is `90d`.
ttl="90d"
## Configuration options for the stats persistence.
[stats_persistence]
## TTL for the stats table that will be used to store the stats.
## Set to `0s` to disable stats persistence.
## Default is `0s`.
## If you want to enable stats persistence, set the TTL to a value greater than 0.
## It is recommended to set a small value, e.g., `3h`.
ttl="0s"
## The interval to persist the stats. Default is `10m`.
## The minimum value is `10m`, if the value is less than `10m`, it will be overridden to `10m`.
interval="10m"
## The logging options.
[logging]
## The directory to store the log files. If set to empty, logs will not be written to files.
@@ -225,7 +303,7 @@ level = "info"
enable_otlp_tracing=false
## The OTLP tracing endpoint.
otlp_endpoint="http://localhost:4318"
otlp_endpoint="http://localhost:4318/v1/traces"
## Whether to append logs to stdout.
append_stdout=true
@@ -239,29 +317,30 @@ max_log_files = 720
## The OTLP tracing export protocol. Can be `grpc`/`http`.
otlp_export_protocol="http"
## Additional OTLP headers, only valid when using OTLP http
[logging.otlp_headers]
## @toml2docs:none-default
#Authorization = "Bearer my-token"
## @toml2docs:none-default
#Database = "My database"
## The percentage of tracing will be sampled and exported.
## Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.
## ratio > 1 are treated as 1. Fractions < 0 are treated as 0
[logging.tracing_sample_ratio]
default_ratio=1.0
## The metasrv can export its metrics and send to Prometheus compatible service (e.g. `greptimedb` itself) from remote-write API.
## This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
[export_metrics]
## whether enable export metrics.
enable=false
## The interval of export metrics.
write_interval="30s"
[export_metrics.remote_write]
## The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`.
url=""
## HTTP headers of Prometheus remote-write carry.
headers={}
## The tracing options. Only effect when compiled with `tokio-console` feature.
#+ [tracing]
## The tokio console address.
## @toml2docs:none-default
#+ tokio_console_addr = "127.0.0.1"
## The memory options.
[memory]
## Whether to enable heap profiling activation during startup.
## When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable
## is set to "prof:true,prof_active:false". The official image adds this env variable.
## Whether to enable read cache. If not set, the read cache will be enabled by default when using object storage.
#+ enable_read_cache = true
## Read cache configuration for object storage such as 'S3' etc, it's configured by default when using object storage. It is recommended to configure it when using object storage for better performance.
## A local file directory, defaults to `{data_home}`. An empty string means disabling.
## @toml2docs:none-default
@@ -559,19 +590,44 @@ write_cache_size = "5GiB"
## @toml2docs:none-default
write_cache_ttl="8h"
## Preload index (puffin) files into cache on region open (default: true).
## When enabled, index files are loaded into the write cache during region initialization,
## which can improve query performance at the cost of longer startup times.
preload_index_cache=true
## Percentage of write cache capacity allocated for index (puffin) files (default: 20).
## The remaining capacity is used for data (parquet) files.
## Must be between 0 and 100 (exclusive). For example, with a 5GiB write cache and 20% allocation,
## 1GiB is reserved for index files and 4GiB for data files.
index_cache_percent=20
## Buffer size for SST writing.
sst_write_buffer_size="8MB"
## Capacity of the channel to send data from parallel scan tasks to the main task.
parallel_scan_channel_size=32
## Maximum number of SST files to scan concurrently.
max_concurrent_scan_files=384
## Whether to allow stale WAL entries read during replay.
allow_stale_entries=false
## Memory limit for table scans across all queries.
## Supports absolute size (e.g., "2GB") or percentage of system memory (e.g., "20%").
## Setting it to 0 disables the limit.
## NOTE: Works with max_concurrent_queries for tiered memory allocation.
## - If max_concurrent_queries is set: 70% of queries get full access, 30% get 70% access.
## - If max_concurrent_queries is 0 (unlimited): first 20 queries get full access, rest get 70% access.
scan_memory_limit="50%"
## Minimum time interval between two compactions.
## To align with the old behavior, the default value is 0 (no restrictions).
min_compaction_interval="0m"
## Whether to enable experimental flat format as the default format.
## Whether to enable the experimental sparse primary key encoding.
experimental_sparse_primary_key_encoding=false
## Whether to use sparse primary key encoding.
sparse_primary_key_encoding=true
## The logging options.
[logging]
@@ -720,7 +776,7 @@ level = "info"
enable_otlp_tracing=false
## The OTLP tracing endpoint.
otlp_endpoint="http://localhost:4318"
otlp_endpoint="http://localhost:4318/v1/traces"
## Whether to append logs to stdout.
append_stdout=true
@@ -734,6 +790,13 @@ max_log_files = 720
## The OTLP tracing export protocol. Can be `grpc`/`http`.
otlp_export_protocol="http"
## Additional OTLP headers, only valid when using OTLP http
[logging.otlp_headers]
## @toml2docs:none-default
#Authorization = "Bearer my-token"
## @toml2docs:none-default
#Database = "My database"
## The percentage of tracing will be sampled and exported.
## Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.
## ratio > 1 are treated as 1. Fractions < 0 are treated as 0
@@ -757,29 +820,16 @@ default_ratio = 1.0
## @toml2docs:none-default
#+ sample_ratio = 1.0
## The standalone can export its metrics and send to Prometheus compatible service (e.g. `greptimedb`) from remote-write API.
## This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
[export_metrics]
## whether enable export metrics.
enable=false
## The interval of export metrics.
write_interval="30s"
## For `standalone` mode, `self_import` is recommended to collect metrics generated by itself
## You must create the database before enabling it.
[export_metrics.self_import]
## @toml2docs:none-default
db="greptime_metrics"
[export_metrics.remote_write]
## The prometheus remote write endpoint that the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`.
url=""
## HTTP headers of Prometheus remote-write carry.
headers={}
## The tracing options. Only effect when compiled with `tokio-console` feature.
#+ [tracing]
## The tokio console address.
## @toml2docs:none-default
#+ tokio_console_addr = "127.0.0.1"
## The memory options.
[memory]
## Whether to enable heap profiling activation during startup.
## When enabled, heap profiling will be activated if the `MALLOC_CONF` environment variable
## is set to "prof:true,prof_active:false". The official image adds this env variable.
@@ -14,3 +14,18 @@ Log Level changed from Some("info") to "trace,flow=debug"%
The data is a string in the format of `global_level,module1=level1,module2=level2,...` that follows the same rule of `RUST_LOG`.
The module is the module name of the log, and the level is the log level. The log level can be one of the following: `trace`, `debug`, `info`, `warn`, `error`, `off`(case insensitive).
Currently, our query engine is based on DataFusion, so all aggregate function is executed by DataFusion, through its UDAF interface. You can find DataFusion's UDAF example [here](https://github.com/apache/arrow-datafusion/blob/arrow2/datafusion-examples/examples/simple_udaf.rs). Basically, we provide the same way as DataFusion to write aggregate functions: both are centered in a struct called "Accumulator" to accumulates states along the way in aggregation.
However, DataFusion's UDAF implementation has a huge restriction, that it requires user to provide a concrete "Accumulator". Take `Median` aggregate function for example, to aggregate a `u32` datatype column, you have to write a `MedianU32`, and use `SELECT MEDIANU32(x)` in SQL. `MedianU32` cannot be used to aggregate a `i32` datatype column. Or, there's another way: you can use a special type that can hold all kinds of data (like our `Value` enum or Arrow's `ScalarValue`), and `match` all the way up to do aggregate calculations. It might work, though rather tedious. (But I think it's DataFusion's preferred way to write UDAF.)
So is there a way we can make an aggregate function that automatically match the input data's type? For example, a `Median` aggregator that can work on both `u32` column and `i32`? The answer is yes until we find a way to bypass DataFusion's restriction, a restriction that DataFusion simply doesn't pass the input data's type when creating an Accumulator.
> There's an example in `my_sum_udaf_example.rs`, take that as quick start.
# 1. Impl `AggregateFunctionCreator` trait for your accumulator creator.
You must first define a struct that will be used to create your accumulator. For example,
```Rust
#[as_aggr_func_creator]
#[derive(Debug, AggrFuncTypeStore)]
structMySumAccumulatorCreator{}
```
Attribute macro `#[as_aggr_func_creator]` and derive macro `#[derive(Debug, AggrFuncTypeStore)]` must both be annotated on the struct. They work together to provide a storage of aggregate function's input data types, which are needed for creating generic accumulator later.
> Note that the `as_aggr_func_creator` macro will add fields to the struct, so the struct cannot be defined as an empty struct without field like `struct Foo;`, neither as a new type like `struct Foo(bar)`.
Then impl `AggregateFunctionCreator` trait on it. The definition of the trait is:
You can use input data's type in methods that return output type and state types (just invoke `input_types()`).
The output type is aggregate function's output data's type. For example, `SUM` aggregate function's output type is `u64` for a `u32` datatype column. The state types are accumulator's internal states' types. Take `AVG` aggregate function on a `i32` column as example, its state types are `i64` (for sum) and `u64` (for count).
The `creator` function is where you define how an accumulator (that will be used in DataFusion) is created. You define "how" to create the accumulator (instead of "what" to create), using the input data's type as arguments. With input datatype known, you can create accumulator generically.
# 2. Impl `Accumulator` trait for your accumulator.
The accumulator is where you store the aggregate calculation states and evaluate a result. You must impl `Accumulator` trait for it. The trait's definition is:
The DataFusion basically executes aggregate like this:
1. Partitioning all input data for aggregate. Create an accumulator for each part.
2. Call `update_batch` on each accumulator with partitioned data, to let you update your aggregate calculation.
3. Call `state` to get each accumulator's internal state, the medial calculation result.
4. Call `merge_batch` to merge all accumulator's internal state to one.
5. Execute `evaluate` on the chosen one to get the final calculation result.
Once you know the meaning of each method, you can easily write your accumulator. You can refer to `Median` accumulator or `SUM` accumulator defined in file `my_sum_udaf_example.rs` for more details.
# 3. Register your aggregate function to our query engine.
You can call `register_aggregate_function` method in query engine to register your aggregate function. To do that, you have to new an instance of struct `AggregateFunctionMeta`. The struct has three fields, first is the name of your aggregate function's name. The function name is case-sensitive due to DataFusion's restriction. We strongly recommend using lowercase for your name. If you have to use uppercase name, wrap your aggregate function with quotation marks. For example, if you define an aggregate function named "my_aggr", you can use "`SELECT MY_AGGR(x)`"; if you define "my_AGGR", you have to use "`SELECT "my_AGGR"(x)`".
The second field is arg_counts ,the count of the arguments. Like accumulator `percentile`, calculating the p_number of the column. We need to input the value of column and the value of p to calculate, and so the count of the arguments is two.
The third field is a function about how to create your accumulator creator that you defined in step 1 above. Create creator, that's a bit intertwined, but it is how we make DataFusion use a newly created aggregate function each time it executes a SQL, preventing the stored input types from affecting each other. The key detail can be starting looking at our `DfContextProviderAdapter` struct's `get_aggregate_meta` method.
# (Optional) 4. Make your aggregate function automatically registered.
If you've written a great aggregate function that wants to let everyone use it, you can make it automatically register to our query engine at start time. It's quick and simple, just refer to the `AggregateFunctions::register` function in `common/function/src/scalars/aggregate/mod.rs`.
The most suitable compaction strategy for time-series scenario would be
a hybrid strategy that combines time window compaction with size-tired compaction, just like [Cassandra](https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/twcs.html) and [ScyllaDB](https://docs.scylladb.com/stable/architecture/compaction/compaction-strategies.html#time-window-compaction-strategy-twcs) does.
a hybrid strategy that combines time window compaction with size-tired compaction, just like [Cassandra](https://cassandra.apache.org/doc/latest/cassandra/managing/operating/compaction/twcs.html) and [ScyllaDB](https://docs.scylladb.com/stable/architecture/compaction/compaction-strategies.html#time-window-compaction-strategy-twcs) does.
We can first group SSTs in level n into buckets according to some predefined time window. Within that window,
SSTs are compacted in a size-tired manner (find SSTs with similar size and compact them to level n+1).
@@ -28,7 +28,7 @@ In order to do those things while maintaining a low memory footprint, you need t
- Greptime Flow's is built on top of [Hydroflow](https://github.com/hydro-project/hydroflow).
- We have three choices for the Dataflow/Streaming process framework for our simple continuous aggregation feature:
1. Based on the timely/differential dataflow crate that [materialize](https://github.com/MaterializeInc/materialize) based on. Later, it's proved too obscure for a simple usage, and is hard to customize memory usage control.
2. Based on a simple dataflow framework that we write from ground up, like what [arroyo](https://www.arroyo.dev/) or [risingwave](https://www.risingwave.dev/) did, for example the core streaming logic of [arroyo](https://github.com/ArroyoSystems/arroyo/blob/master/arroyo-datastream/src/lib.rs) only takes up to 2000 line of codes. However, it means maintaining another layer of dataflow framework, which might seem easy in the beginning, but I fear it might be too burdensome to maintain once we need more features.
2. Based on a simple dataflow framework that we write from ground up, like what [arroyo](https://www.arroyo.dev/) or [risingwave](https://www.risingwave.dev/) did, for example the core streaming logic of [arroyo](https://github.com/ArroyoSystems/arroyo/blob/master/crates/arroyo-datastream/src/lib.rs) only takes up to 2000 line of codes. However, it means maintaining another layer of dataflow framework, which might seem easy in the beginning, but I fear it might be too burdensome to maintain once we need more features.
3. Based on a simple and lower level dataflow framework that someone else write, like [hydroflow](https://github.com/hydro-project/hydroflow), this approach combines the best of both worlds. Firstly, it boasts ease of comprehension and customization. Secondly, the dataflow framework offers precisely the necessary features for crafting uncomplicated single-node dataflow programs while delivering decent performance.
Hence, we choose the third option, and use a simple logical plan that's anagonistic to the underlying dataflow framework, as it only describe how the dataflow graph should be doing, not how it do that. And we built operator in hydroflow to execute the plan. And the result hydroflow graph is wrapped in a engine that only support data in/out and tick event to flush and compute the result. This provide a thin middle layer that's easy to maintain and allow switching to other dataflow framework if necessary.
This phase is for static analysis of the new partition rule. The server can know whether the repartitioning is possible, how to do the repartitioning, and how much resources are needed.
In theory, the input and output partition rules for repartitioning can be completely unrelated. But in practice, to avoid a very large change set, we'll only allow two simple kinds of change. One splits one region into two regions (region split) and another merges two regions into one (region merge).
After validating the new partition rule using the same validation logic as table creation, we compute the difference between the old and new partition rules. The resulting diff may contain several independent groups of changes. During subsequent processing, each group of changes can be handled independently and can succeed or fail without affecting other groups or creating non-idempotently retryable scenarios.
Next, we generate a repartition plan for each group of changes. Each plan contains this information for all regions involved in that particular plan. And one target region will only be referenced by a single plan.
With those plans, we can determine the resource requirements for the repartition operation, where resources here primarily refer to Regions. Metasrv will coordinate with PaaS layer to pre-allocate the necessary regions at this stage. These new regions start completely empty, and their metadata and manifests will be populated during subsequent modification steps.
## Data Processing
This phase is primarily for region's change, including region's metadata (route table and the corresponding rule) and manifest.
Once we start processing one plan through a procedure, we'll first stop the region's compaction and snapshot. This is to avoid any states being removed due to compaction (which may removes old SST files) and snapshot (which may removes old manifest files).
Metasrv will trying to update the metadata of partition, or the region route table (related to `PartitionRuleManager`). This step is in the "no ingestion" scope, so no new data will be ingested. Since this won't take much time, the affection to the cluster is minimized. Metasrv will also update the region rule to corresponding regions on Datanodes.
Every regions and all the ingestion requests to the region server will have a version of region rule, to identify under which rule the request is processed. The version can be something like `hash(region_rule)`. Once the region rule on region server is updated, all ingestion request with old rule will be rejected, and all requests with new rule will be accepted but not visible. They can still be flushed to persisted storage, but their version change (new manifest) will be staged.
Then region 0 (or let metasrv to pick any operational region) will compute the new manifests for all target regions. This step is done by first reading all old manifests, and remapping the files with new partition rule, to get the content of new manifests. Notice this step only handles the manifests before region rule change on region server, and won't touch those staged manifests, as they are already with the new rule.
Those new manifest will be submitted to the corresponding target regions by region 0 via a `RegionEdit` request. If this request falls after a few retries, region 0 will try to rollback this change by directly overwriting the manifest on object storage. and report this failure to metasrv and let the entire repartition procedure to fail. And we can also optionally compute the new manifest for those staged version changes (like another repartition) and submit them to the target regions to make the also visible even if the repartition fails.
In the other hand, a successful `RegionEdit` request also acknowledges those staged version changes and make them visible.
After this step, the repartition is done in the data plane. We can start to process compaction and snapshot again.
## Postprocessing
After the main processing is done, we can do some extra postprocessing to reduce the performance impact of repartition. Including reloading caches in frontend's route table, metasrv's kv cache and datanode's read/write/page cache etc.
We can also schedule an optional compaction to reorganize all the data file under the new partition rule to reduce potential fragmentation or read amplification.
## Procedure
Here describe the repartition procedure step by step:
-<onfrontend> Validating repartition request
-<onfrontend> Initialize the repartition procedure
- Calculate rule diff and repartition plan group
- Allocate necessary new regions
- Lock the table key
- For each repartition subprocedure
- Stop compaction and snapshot
- Forbid new ingestion requests, update metadata, allow ingestion requests.
- Update region rule to regions
- Pick one region to calculate new manifest for all regions in this repartition group
- Let that region to apply new manifest to each region via `RegionEdit`
- If failed after some retries, revert this manifest change to other succeeded regions and mark this failure.
- If all succeeded, acknowledge those staged version changes and make them visible.
- Return result
- Collect results from subprocedure.
- For those who failed, we need to restart those regions to force reconstruct their status from manifests
- For those who succeeded, collect and merge their rule diff
- Unlock the table key
- Report the result to user.
-<inbackground> Reload cache
-<inbackground> Maybe trigger a special compaction
In addition of sequential step, rollback is also an important part of this procedure. There are three steps can be rolled back when unrecoverable failure occurs.
If the metadata update is not committed, we can overwrite the metadata to previous version. This step is scoped in the "no ingestion" period, so no new data will be ingested and the status of both datanode and metasrv will be consistent.
If the `RegionEdit` to other regions is not acknowledged, or partial acknowledged, we can directly overwrite the manifest on object storage from the central region (who computes the new manifest), and force region server to reload corresponding region to load its state from object storage to recover.
If the staged version changes are not acknowledged, we can re-compute manifest based on old rule for staged data, and apply them directly like above. This is like another smaller repartition for those staged data.
## Region rule validation and diff calculation
In the current codebase, the rule checker is not complete. It can't check uniqueness and completeness of the rule. This RFC also propose a new way to validate the rule.
The proposed validation way is based on a check-point system, which first generates a group of check-points from the rule, and then check if all the point is covered and only covered by one rule.
All the partition rule expressionis limited to be the form of `<column> <operator> <value>`, and the operator is limited to be comparison operators. Those expressions are allowed to be nested with `AND` and `OR` operators. Based on this, we can first extract all the unique values on each column, adding and subtracting a little epsilon to cover its left and right boundary.
Since we accept integer, float and string as the value type, compute on them directly is not convenient. So we'll first normalize them to a common type and only need to preserve the relative partial ordering. This also avoids the problem of "what is next/previous value" of string and "what's a good precision" for float.
After normalization, we get a set of scatter points for each column. Then we can generate a set of check-points by combining all the scatter points like building a cartesian product. This might bring a large number of check-points, so we can do an prune optimization to remove some of them by merging some of the expression zones. Those expressions who have identical N-1 edge sub-expressions with one adjacent edge can be merged together. This prune check is with a time complexity of O(N * M * log(M)), where N is the number of active dimensions and M is the number of expression zones. Diff calculation is also done by finding different expression zones between the old and new rule set, and check if we can transform one to another by merging some of the expression zones.
The step to validate the check-points set against expressions can be treated as a tiny expression of `PhysicalExpr`. This evaluation will give a boolean matrix of K*M shape, where K is the number of check-points. We then check in each row of the matrix, if there is one and only one true value.
## Compute and use new manifest
We can generate a new set of manifest file based on old manifest and two versions of rule. From abvoe rule processing part, we can tell how a new rule & region is from previous one. So a simple way to get the new manifest is also apply the step of change to manifest files. E.g., if region A is from region B and C, we simply combine all file IDs from B and C to generate the content of A.
If necessary, we can do this better by involving some metadata related to data, like min-max statistics of each file, and pre-evaluate over min-max to filter out unneeded files when generating new manifest.
The way to use new manifest needs one more extra step based on the current implementation. We'll need to record either in manifest or in file metadata, of what rule is used when generating (flush or compaction) a SST file. Then in every single read request, we need to append the current region rule as predicate to the read request, to ensure no data belong to other regions will be read. We can use the stored region rule to reduce the number of new predicates to apply, by removing the identical predicate between the current region rule and the stored region rule. So ideally in a table that has not been repartitioned recently, the overhead of checking region rule is minimal.
## Pre-required tasks
In above steps, we assume some functionalities are implemented. Here list them with where they are used and how to implement them.
### Cross-region read
The current data directory structure is `{table_id}/{region_id}/[data/metadata]/{file_id}`, every region can only access files under their own directory. After repartition, data file may be placed in other previous old regions. So we need to support cross-region read. This new access method allows region to access any file under the same table. Related tracking issue is <https://github.com/GreptimeTeam/greptimedb/issues/6409>.
### Global GC worker
This is to simplify state management of data files. As one file may be referenced in multiple manifests, or no manifest at all. After this, every region and the repartition process only need to care about generateing and using new files, without tracking whether a file should be deleted or not. Leaving the deletion to the global GC worker. This worker basically works by counting reference from manifest file, and remove unused one. Related tracking issue is **TBD**.
# Alternatives
In the "Data Processing" section, we can enlarge the "no ingestion" period to include almost all the steps. This can simplify the entire procedure by a lot, but will bring a longer time of ingestion pause which may not be acceptable.
This RFC proposes a compatibility test framework for GreptimeDB to ensure backward/forward compatibility for different versions of GreptimeDB.
# Motivation
In current practice, we don't have a systematic way to test and ensure the compatibility of different versions of GreptimeDB. Each time we release a new version, we need to manually test the compatibility with ad-hoc cases. This is not only time-consuming, but also prone to errors and unmaintainable. Highly rely on the release manager to ensure the compatibility of different versions of GreptimeDB.
We don't have a detailed guide on the release SoP of how to test and ensure the compatibility of the new version. And has broken the compatibility of the new version many times (`v0.14.1` and `v0.15.1` are two examples, which are both released right after the major release).
# Details
This RFC proposes a compatibility test framework that is easy to maintain, extend and run. It can tell the compatibility between any given two versions of GreptimeDB, both backward and forward. It's based on the Sqlness library but used in a different way.
Generally speaking, the framework is composed of two parts:
1. Test cases: A set of test cases that are maintained dedicatedly for the compatibility test. Still in the `.sql` and `.result` format.
2. Test framework: A new sqlness runner that is used to run the test cases. With some new features that is not required by the integration sqlness test.
## Test Cases
### Structure
The case set is organized in three parts:
-`1.feature`: Use a new feature
-`2.verify`: Verify database behavior
-`3.cleanup`: Paired with `1.feature`, cleanup the test environment.
These three parts are organized in a tree structure, and should be run in sequence:
```
compatibility_test/
├── 1.feature/
│ ├── feature-a/
│ ├── feature-b/
│ └── feature-c/
├── 2.verify/
│ ├── verify-metadata/
│ ├── verify-data/
│ └── verify-schema/
└── 3.cleanup/
├── cleanup-a/
├── cleanup-b/
└── cleanup-c/
```
### Example
For example, for a new feature like adding new index option ([#6416](https://github.com/GreptimeTeam/greptimedb/pull/6416)), we (who implement the feature) create a new test case like this:
Since this new feature don't require some special way to verify the database behavior, we can reuse existing test cases in `2.verify/` to verify the database behavior. For example, we can reuse the `verify-metadata` test case to verify the metadata of the table.
In this example, we use some new sqlness features that will be introduced in the next section (`since`, `IGNORE_RESULT`, `TEMPLATE`).
### Maintenance
Each time implement a new feature that should be covered by the compatibility test, we should create a new test case in `1.feature/` and `3.cleanup/` for them. And check if existing cases in `2.verify/` can be reused to verify the database behavior.
This simulates an enthusiastic user who uses all the new features at the first time. All the new Maintenance burden is on the feature implementer to write one more test case for the new feature, to "fixation" the behavior. And once there is a breaking change in the future, it can be detected by the compatibility test framework automatically.
Another topic is about deprecation. If a feature is deprecated, we should also mark it in the test case. Still use above example, assume we deprecate the `index.granularity` and `index.false_positive_rate` index options in `v0.99.0`, we can mark them as:
```sql
-- SQLNESS ARG since=0.15.0 till=0.99.0
...
```
This tells the framework to ignore this feature in version `v0.99.0` and later. Currently, we have so many experimental features that are scheduled to be broken in the future, this is a good way to mark them.
## Test Framework
This section is about new sqlness features required by this framework.
### Since and Till
Follows the `ARG` interceptor in sqlness, we can mark a feature is available between two given versions. Only the `since` is required:
`IGNORE_RESULT` is a new interceptor, it tells the runner to ignore the result of the query, only check whether the query is executed successfully.
This is useful to reduce the Maintenance burden of the test cases, unlike the integration sqlness test, in most cases we don't care about the result of the query, only need to make sure the query is executed successfully.
### TEMPLATE
`TEMPLATE` is another new interceptor, it can generate queries from a template based on a runtime data.
In above example, we need to run the `SHOW CREATE TABLE` query for all existing tables, so we can use the `TEMPLATE` interceptor to generate the query with a dynamic table list.
### RUNNER
There are also some extra requirement for the runner itself:
- It should run the test cases in sequence, first `1.feature/`, then `2.verify/`, and finally `3.cleanup/`.
- It should be able to fetch required version automatically to finish the test.
- It should handle the `since` and `till` properly.
On the `1.feature` phase, the runner needs to identify all features need to be tested by version number. And then restart with a new version (the `to` version) to run `2.verify/` and `3.cleanup/` phase.
## Test Report
Finally, we can run the compatibility test to verify the compatibility between any given two versions of GreptimeDB, for example:
```bash
# check backward compatibility between v0.15.0 and v0.16.0 when releasing v0.16.0
./sqlness run --from=0.15.0 --to=0.16.0
# check forward compatibility when downgrading from v0.15.0 to v0.13.0
./sqlness run --from=0.15.0 --to=0.13.0
```
We can also use a script to run the compatibility test for all the versions in a given range to give a quick report with all versions we need.
And we always bump the version in `Cargo.toml` to the next major release version, so the next major release version can be used as "latest" unpublished version for scenarios like local testing.
# Alternatives
There was a previous attempt to implement a compatibility test framework that was disabled due to some reasons [#3728](https://github.com/GreptimeTeam/greptimedb/issues/3728).
This RFC proposes the integration of a garbage collection (GC) mechanism within the Compaction process. This mechanism aims to manage and remove stale files that are no longer actively used by any system component, thereby reclaiming storage space.
## Motivation
With the introduction of features such as table repartitioning, a substantial number of Parquet files can become obsolete. Furthermore, failures during manifest updates may result in orphaned files that are never referenced by the system. Therefore, a periodic garbage collection mechanism is essential to reclaim storage space by systematically removing these unused files.
## Details
### Overview
The garbage collection process will be integrated directly into the Compaction process. Upon the completion of a Compaction for a given region, the GC worker will be automatically triggered. Its primary function will be to identify and subsequently delete obsolete files that have persisted beyond their designated retention period. This integration ensures that garbage collection is performed in close conjunction with data lifecycle management, effectively leveraging the compaction process's inherent knowledge of file states.
This design prioritizes correctness and safety by explicitly linking GC execution to a well-defined operational boundary: the successful completion of a compaction cycle.
### Terminology
- **Unused File**: Refers to a file present in the storage directory that has never been formally recorded in any manifest. A common scenario for this includes cases where a new SST file is successfully written to storage, but the subsequent update to the manifest fails, leaving the file unreferenced.
- **Obsolete File**: Denotes a file that was previously recorded in a manifest but has since been explicitly marked for removal. This typically occurs following operations such as data repartitioning or compaction.
### GC Worker Process
The GC worker operates as an integral part of the Compaction process. Once a Compaction for a specific region is completed, the GC worker is automatically triggered. Executing this process on a `datanode` is preferred to eliminate the overhead associated with having to set object storage configurations in the `metasrv`.
The detailed process is as follows:
1.**Invocation**: Upon the successful completion of a Compaction for a region, the GC worker is invoked.
2.**Manifest Reading**: The worker reads the region's primary manifest to obtain a comprehensive list of all files marked as obsolete. Concurrently, it reads any temporary manifests generated by long-running queries to identify files that are currently in active use, thereby preventing their premature deletion.
3.**Lingering Time Check (Obsolete Files)**: For each identified obsolete file, the GC worker evaluates its "lingering time." Which is the time passed after it had been removed from manifest.
4.**Deletion Marking (Obsolete Files)**: Files that have exceeded their maximum configurable lingering time and are not referenced by any active temporary manifests are marked for deletion.
5.**Lingering Time (Unused Files)**: Unused files (those never recorded in any manifest) are also subject to a configurable maximum lingering time before they are eligible for deletion.
Following flowchart illustrates the GC worker's process:
```mermaid
flowchart TD
A[Compaction Completed] --> B[Trigger GC Worker]
B --> C[Scan Region Manifest]
C --> D[Identify File Types]
D --> E[Unused Files<br/>Never recorded in manifest]
D --> F[Obsolete Files<br/>Previously in manifest<br/>but marked for removal]
E --> G[Check Lingering Time]
F --> G
G --> H{File exceeds<br/>configured lingering time?}
H -->|No| I[Skip deletion]
H -->|Yes| J[Check Temporary Manifest]
J --> K{File in use by<br/>active queries?}
K -->|Yes| L[Retain file<br/>Wait for next GC cycle]
K -->|No| M[Safely delete file]
I --> N[End GC cycle]
L --> N
M --> O[Update Manifest]
O --> N
N --> P[Wait for next Compaction]
P --> A
style A fill:#e1f5fe
style B fill:#f3e5f5
style M fill:#e8f5e8
style L fill:#fff3e0
```
#### Handling Obsolete Files
An obsolete file is permanently deleted only if two conditions are met:
1. The time elapsed since its removal from the manifest (its obsolescence timestamp) exceeds a configurable threshold.
2. It is not currently referenced by any active temporary manifests.
#### Handling Unused Files
With the integration of the GC worker into the Compaction process, the risk of accidentally deleting newly created SST files that have not yet been recorded in the manifest is significantly mitigated. Consequently, the concept of "Unused Files" as a distinct category primarily susceptible to accidental deletion is largely resolved. Any files that are genuinely "unused" (i.e., never referenced by any manifest, including temporary ones) can be safely deleted after a configurable maximum lingering time.
For debugging and auditing purposes, a comprehensive list of recently deleted files can be maintained.
### Ensuring Read Consistency
To prevent the GC worker from inadvertently deleting files that are actively being utilized by long-running analytical queries, a robust protection mechanism is introduced. This mechanism relies on temporary manifests that are actively kept "alive" by the queries using them.
When a long-running query is detected (e.g., by a slow query recorder), it will write a temporary manifest to the region's manifest directory. This manifest lists all files required for the query. However, simply creating this file is not enough, as a query runner might crash, leaving the temporary manifest orphaned and preventing garbage collection indefinitely.
To address this, the following "heartbeat" mechanism is implemented:
1.**Periodic Updates**: The process executing the long-running query is responsible for periodically updating the modification timestamp of its temporary manifest file (i.e., "touching" the file). This serves as a heartbeat, signaling that the query is still active.
2.**GC Worker Verification**: When the GC worker runs, it scans for temporary manifests. For each one it finds, it checks the file's last modification time.
3.**Stale File Handling**: If a temporary manifest's last modification time is older than a configurable threshold, the GC worker considers it stale (left over from a crashed or terminated query). The GC worker will then delete this stale temporary manifest. Files that were protected only by this stale manifest are no longer shielded from garbage collection.
This approach ensures that only files for genuinely active queries are protected. The lifecycle of the temporary manifest is managed dynamically: it is created when a long query starts, kept alive through periodic updates, and is either deleted by the query upon normal completion or automatically cleaned up by the GC worker if the query terminates unexpectedly.
This mechanism may be too complex to implement at once. We can consider a two-phased approach:
1.**Phase 1 (Simple Time-Based Deletion)**: Initially, implement a simpler GC strategy that deletes obsolete files based solely on a configurable lingering time. This provides a baseline for space reclamation without the complexity of temporary manifests.
2.**Phase 2 (Consistency-Aware GC)**: Based on the practical effectiveness and observed issues from Phase 1, we can then decide whether to implement the full temporary manifest and heartbeat mechanism to handle long-running queries. This iterative approach allows for a quicker initial implementation while gathering real-world data to justify the need for a more complex solution.
## Drawbacks
- **Dependency on Compaction Frequency**: The integration of the GC worker with Compaction means that GC cycles are directly tied to the frequency of compactions. In environments with infrequent compaction operations, obsolete files may accumulate for extended periods before being reclaimed, potentially leading to increased storage consumption.
- **Race Condition with Long-Running Queries**: A potential race condition exists if a long-running query initiates but haven't write its temporary manifest in time, while a compaction process simultaneously begins and marks files used by that query as obsolete. This scenario could lead to the premature deletion of files still required by the active query. To mitigate this, the threshold time for writing a temporary manifest should be significantly shorter than the lingering time configured for obsolete files, ensuring that next GC worker runs do not delete files that are now referenced by a temporary manifest if the query is still running.
Also the read replica shouldn't be later in manifest version for more than the lingering time of obsolete files, otherwise it might ref to files that are already deleted by the GC worker.
- need to upload tmp manifest to object storage, which may introduce additional complexity and potential performance overhead. But since long-running queries are typically not frequent, the performance impact is expected to be minimal.
one potential race condition with region-migration is illustrated below:
```mermaid
sequenceDiagram
participant gc_worker as GC Worker(same dn as region 1)
participant region1 as Region 1 (Leader → Follower)
participant region2 as Region 2 (Follower → Leader)
participant region_dir as Region Directory
gc_worker->>region1: Start GC, get region manifest
activate region1
region1-->>gc_worker: Region 1 manifest
deactivate region1
gc_worker->>region_dir: Scan region directory
Note over region1,region2: Region Migration Occurs
region1-->>region2: Downgrade to Follower
region2-->>region1: Becomes Leader
region2->>region_dir: Add new file
gc_worker->>region_dir: Continue scanning
gc_worker-->>region_dir: Discovers new file
Note over gc_worker: New file not in Region 1's manifest
gc_worker->>gc_worker: Mark file as orphan(incorrectly)
```
which could cause gc worker to incorrectly mark the new file as orphan and delete it, if config the lingering time for orphan files(files not mentioned anywhere(in used or unused)) is not long enough.
A good enough solution could be to use lock to prevent gc worker to happen on the region if region migration is happening on the region, and vise versa.
The race condition between gc worker and repartition also needs to be considered carefully. For now, acquiring lock for both region-migration and repartition during gc worker process could be a simple solution.
## Conclusion and Rationale
This section summarizes the key aspects and trade-offs of the proposed integrated GC worker, highlighting its advantages and potential challenges.
| Aspect | Current Proposal (Integrated GC) |
| :--- | :--- |
| **Implementation Complexity** | **Medium**. Requires careful integration with the compaction process and the slow query recorder for temporary manifest management. |
| **Reliability** | **High**. Integration with compaction and leveraging temporary manifests from long-running queries significantly mitigates the risk of incorrect deletion. Accurate management of lingering times for obsolete files and prevention of accidental deletion of newly created SSTs enhance data safety. |
| **Performance Overhead** | **Low to Medium**. The GC worker runs post-compaction, minimizing direct impact on write paths. Overhead from temporary manifest management by the slow query recorder is expected to be acceptable for long-running queries. |
| **Impact on Other Components** | **Moderate**. Requires modifications to the compaction process to trigger GC and the slow query recorder to manage temporary manifests. This introduces some coupling but enhances overall data safety. |
| **Deletion Strategy** | **State- and Time-Based**. Obsolete files are deleted based on a configurable lingering time, which is paused if the file is referenced by a temporary manifest. Unused files (never in a manifest) are also subject to a lingering time. |
## Unresolved Questions and Future Work
This section outlines key areas requiring further discussion and defines potential avenues for future development.
***Slow Query Recorder Implementation**: Detailed specifications for modify slow query recorder's implementation and its precise interaction mechanisms with temporary manifests are needed.
***Configurable Lingering Times**: Establish and make configurable the specific lingering times for both obsolete and unused files to optimize storage reclamation and data availability.
## Alternatives
### 1. Standalone GC Service
Instead of integrating the GC worker directly into the Compaction process, a standalone GC service could be implemented. This service would operate independently, periodically scanning the storage for obsolete and unused files based on manifest information and predefined retention policies.
**Pros:**
* **Decoupling**: Separates GC logic from compaction, allowing independent scaling and deployment.
* **Flexibility**: Can be configured to run at different frequencies and with different strategies than compaction.
**Cons:**
* **Increased Complexity**: Requires a separate service to manage, monitor, and coordinate with other components.
* **Potential for Redundancy**: May duplicate some file scanning logic already present in compaction.
* **Consistency Challenges**: Ensuring read consistency would require more complex coordination mechanisms between the standalone GC service and active queries, potentially involving a distributed lock manager or a more sophisticated temporary manifest system.
This alternative could be implemented in the future if the integrated GC worker proves insufficient or if there is a need for more advanced GC strategies.
### 2. Manifest-Driven Deletion (No Lingering Time)
This alternative would involve immediate deletion of files once they are removed from the manifest, without a lingering time.
**Pros:**
* **Simplicity**: Simplifies the GC logic by removing the need for lingering time management.
* **Immediate Space Reclamation**: Storage space is reclaimed as soon as files are marked for deletion.
**Cons:**
* **Increased Risk of Data Loss**: Higher risk of deleting files still in use by long-running queries or other processes if not perfectly synchronized.
* **Complex Read Consistency**: Requires extremely robust and immediate mechanisms to ensure that no active queries are referencing files marked for deletion, potentially leading to performance bottlenecks or complex error handling.
* **Debugging Challenges**: Difficult to debug issues related to premature file deletion due to the immediate nature of the operation.
This RFC proposes an asynchronous index build mechanism in the database, with a configuration option to choose between synchronous and asynchronous modes, aiming to improve flexibility and adapt to different workload requirements.
# Motivation
Currently, index creation is performed synchronously, which may lead to prolonged write suspension and impact business continuity. As data volume grows, the time required for index building increases significantly. An asynchronous solution is urgently needed to enhance user experience and system throughput.
# Details
## Overview
The following table highlights the difference between async and sync index approach:
| Approach | Trigger | Data Source | Additional Index Metadata Installation | Fine-grained `FileMeta` Index |
| :--- | :--- | :--- | :--- | :--- |
| Sync Index | On `write_sst` | Memory (on flush) / Disk (on compact) | Not required(already installed synchronously) | Not required |
| Async Index | 4 trigger types | Disk | Required | Required |
The index build mode (synchronous or asynchronous) can be selected via configuration file.
### Four Trigger Types
This RFC introduces four `IndexBuildType`s to trigger index building:
- **Manual Rebuild**: Triggered by the user via `ADMIN build_index("table_name")`, for scenarios like recovering from failed builds or migrating data. SST files whose `ColumnIndexMetadata` (see below) is already consistent with the `RegionMetadata` will be skipped.
- **Schema Change**: Automatically triggered when the schema of an indexed column is altered.
- **Flush**: Automatically builds indexes for new SST files created by a flush.
- **Compact**: Automatically builds indexes for new SST files created by a compaction.
### Additional Index Metadata Installation
Previously, index information in the in-memory `FileMeta` was updated synchronously. The async approach requires an explicit installation step.
A race condition can occur when compaction and index building run concurrently, leading to:
1. Building an index for a file that is about to be deleted by compaction.
2. Creating an unnecessary index file and an incorrect manifest record.
3. On restart, replaying the manifest could load metadata for a non-existent file.
To prevent this, the system checks if a file's `FileMeta` is in a `compacting` state before updating the manifest. If it is, the installation is aborted.
### Fine-grained `FileMeta` Index
The original `FileMeta` only stored file-level index information. However, manual rebuilds require column-level details to identify files inconsistent with the current DDL. Therefore, the `indexes` field in `FileMeta` is updated as follows:
```rust
structFileMeta{
...
// From file-level:
// available_indexes: SmallVec<[IndexType; 4]>
// To column-level:
indexes: Vec<ColumnIndexMetadata>,
...
}
pubstructColumnIndexMetadata{
pubcolumn_id: ColumnId,
pubcreated_indexes: IndexTypes,
}
```
## Process
The index building process is similar to a flush and is illustrated below:
```mermaid
sequenceDiagram
Region0->>Region0: Triggered by one of 4 conditions, targets specific files
loop For each target file
Region0->>IndexBuildScheduler: Submits an index build task
end
IndexBuildScheduler->>IndexBuildTask: Executes the task
IndexBuildTask->>Storage Interfaces: Reads SST data from disk
IndexBuildTask->>IndexBuildTask: Builds the index file
Region0->>Storage Interfaces: Updates manifest and Version
end
```
### Task Triggering and Scheduling
The process starts with one of the four `IndexBuildType` triggers. In `handle_rebuild_index`, the `RegionWorkerLoop` identifies target SSTs from the request or the current region version. It then creates an `IndexBuildTask` for each file and submits it to the `index_build_scheduler`.
Similar to Flush and Compact operations, index build tasks are ultimately dispatched to the LocalScheduler. Resource usage can be adjusted via configuration files. Since asynchronous index tasks are both memory-intensive and IO-intensive but have lower priority, it is recommended to allocate fewer resources to them compared to compaction and flush tasks—for example, limiting them to 1/8 of the CPU cores.
### Index Building and Notification
The scheduled `IndexBuildTask` executes its `index_build` method. It uses an `indexer_builder` to create an `Indexer` that reads SST data and builds the index. If a new index file is created (`IndexOutput.file_size > 0`), the task sends an `IndexBuildFinished` notification back to the `RegionWorkerLoop`.
### Index Metadata Installation
Upon receiving the `IndexBuildFinished` notification in `handle_index_build_finished`, the `RegionWorkerLoop` verifies that the file still exists in the current `version` and is not being compacted. If the check passes, it calls `manifest_ctx.update_manifest` to apply a `RegionEdit` with the new index information, completing the installation.
# Drawbacks
Asynchronous index building may consume extra system resources, potentially affecting overall performance during peak periods.
There may be a delay before the new index becomes available for queries, which could impact certain use cases.
# Unresolved Questions and Future Work
**Resource Management and Throttling**: The resource consumption (CPU, I/O) of background index building can be managed and limited to some extent by configuring a dedicated background thread pool. However, this approach cannot fully eliminate resource contention, especially under heavy workloads or when I/O is highly competitive. Additional throttling mechanisms or dynamic prioritization may still be necessary to avoid impacting foreground operations.
# Alternatives
Instead of being triggered by events like Flush or Compact, index building could be performed in batches during scheduled maintenance windows. This offers predictable resource usage but delays index availability.
This RFC proposes a redesign of the flow architecture where flownode becomes a lightweight in-memory state management node with an embedded frontend for direct computation. This approach optimizes resource utilization and improves scalability by eliminating network hops while maintaining clear separation between coordination and computation tasks.
## Motivation
The current flow architecture has several limitations:
1.**Resource Inefficiency**: Flownodes perform both state management and computation, leading to resource duplication and inefficient utilization.
2.**Scalability Constraints**: Computation resources are tied to flownode instances, limiting horizontal scaling capabilities.
3.**State Management Complexity**: Mixing computation with state management makes the system harder to maintain and debug.
4.**Network Overhead**: Additional network hops between flownode and separate frontend nodes add latency.
The laminar Flow architecture addresses these issues by:
- Consolidating computation within flownode through embedded frontend
- Eliminating network overhead by removing separate frontend node communication
- Simplifying state management by focusing flownode on its core responsibility
- Improving system scalability and maintainability
## Details
### Architecture Overview
The laminar Flow architecture transforms flownode into a lightweight coordinator that maintains flow state with an embedded frontend for computation. The key components involved are:
1.**Flownode**: Maintains in-memory state, coordinates computation, and includes an embedded frontend for query execution
2.**Embedded Frontend**: Executes **incremental** computations within the flownode
3.**Datanode**: Stores final results and source data
The datanode processes these parameters to return only the data within the specified sequence ranges, ensuring efficient incremental processing.
#### Sequence Invalidation and Refill Mechanism
A critical challenge occurs when data referenced by `memtable_last_seq` gets flushed from memory to disk. Since SST files only maintain a single maximum sequence number for the entire file (rather than per-record sequence tracking), precise incremental queries become impossible for the affected time ranges.
**Detection of Invalidation:**
```rust
// When memtable_last_seq data has been flushed to SST
ifmemtable_last_seq_flushed_to_disk{
// Incremental query is no longer feasible
// Need to trigger refill for affected time ranges
}
```
**Refill Process:**
1.**Identify Affected Time Range**: Query the time range corresponding to the flushed `memtable_last_seq` data
2.**Full Recomputation**: Execute a complete aggregation query for the affected time windows
3.**State Replacement**: Replace the existing flow state for these time ranges with newly computed values
4.**Sequence Update**: Update `memtable_last_seq` to the current latest sequence, while `sst_last_seq` continues normal incremental updates
```sql
-- Refill query when memtable data has been flushed
SELECT
__aggr_state(aggregation_functions)asstate,
time_window,
group_keys
FROMsource_table
WHERE
timestamp>=:affected_time_start
ANDtimestamp<:affected_time_end
-- Full scan required since sequence precision is lost in SST
GROUPBYtime_window,group_keys;
```
#### Datanode Implementation Requirements
Datanode must implement enhanced query processing capabilities to support sequence-based incremental reads:
**Input Processing:**
- Accept `memtable_last_seq` and `sst_last_seq` parameters in query requests
- Filter data based on sequence ranges across both memtable and SST storage layers
**Output Enhancement:**
```rust
structOutputMeta{
pubplan: Option<Arc<dynExecutionPlan>>,
pubcost: OutputCost,
pubsequence_info: HashMap<RegionId,SequenceInfo>,// New field for sequence tracking per regions involved in the query
}
structSequenceInfo{
// Sequence tracking for next iteration
max_memtable_seq: SequenceNumber,// Highest sequence from memtable in this result
max_sst_seq: SequenceNumber,// Highest sequence from SST in this result
}
```
**Sequence Tracking Logic:**
datanode already impl `max_sst_seq` in leader range read, can reuse similar logic for `max_memtable_seq`.
#### Sequence Update Strategy
**Normal Incremental Updates:**
- Update both `memtable_last_seq` and `sst_last_seq` after successful query execution
- Use returned `max_memtable_seq` and `max_sst_seq` values for next iteration
**Refill Scenario:**
- Reset `memtable_last_seq` to current maximum after refill completion
- Continue normal `sst_last_seq` updates based on successful query responses
- Maintain separate tracking to detect future flush events
#### Performance Considerations
**Sequence Range Optimization:**
- Minimize sequence range spans to reduce scan overhead
- Batch multiple small incremental updates when beneficial
- Balance between query frequency and processing efficiency
**Memory Management:**
- Monitor memtable flush frequency to predict refill requirements
- Implement adaptive query scheduling based on flush patterns
- Optimize state storage to handle frequent updates efficiently
This sequential read implementation ensures reliable incremental processing while gracefully handling the complexities of storage architecture, maintaining both correctness and performance in the face of background compaction and flush operations.
## Implementation Plan
### Phase 1: Core Infrastructure
1.**State Management**: Implement in-memory state map in flownode
2.**Query Interface**: Integrate `__aggr_state` query interface in embedded frontend(Already done in previous query pushdown optimizer work)
3.**Basic Coordination**: Implement query dispatch and result collection
4.**Sequence Tracking**: Implement sequence-based incremental processing(Can use similar interface which leader range read use)
After phase 1, the system should support basic flow operations with incremental updates.
### Phase 2: Optimization Features
1.**Refill Logic**: Develop state recovery mechanisms
- **Datanode Optimization**: Optimize result writing from flownode
- **Metasrv Coordination**: Enhanced metadata management and coordination
## Conclusion
The laminar Flow architecture represents a significant improvement over the current flow system by separating state management from computation execution. This design enables better resource utilization, improved scalability, and simplified maintenance while maintaining the core functionality of continuous aggregation.
The key benefits include:
1.**Improved Scalability**: Computation can scale independently of state management
While the architecture introduces some complexity in terms of distributed coordination and error handling, the benefits significantly outweigh the drawbacks, making it a compelling evolution of the flow system.
@@ -83,7 +83,7 @@ If you use the [Helm Chart](https://github.com/GreptimeTeam/helm-charts) to depl
- `monitoring.enabled=true`: Deploys a standalone GreptimeDB instance dedicated to monitoring the cluster;
- `grafana.enabled=true`: Deploys Grafana and automatically imports the monitoring dashboard;
The standalone GreptimeDB instance will collect metrics from your cluster, and the dashboard will be available in the Grafana UI. For detailed deployment instructions, please refer to our [Kubernetes deployment guide](https://docs.greptime.com/user-guide/deployments-administration-administration/deploy-on-kubernetes/getting-started).
The standalone GreptimeDB instance will collect metrics from your cluster, and the dashboard will be available in the Grafana UI. For detailed deployment instructions, please refer to our [Kubernetes deployment guide](https://docs.greptime.com/user-guide/deployments-administration/deploy-on-kubernetes/overview).
### Self-host Prometheus and import dashboards manually
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
| Datanode Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$datanode"}) by (instance, pod)` | `timeseries` | Current memory usage by instance | `prometheus` | `decbytes` | `[{{instance}}]-[{{ pod }}]` |
| Datanode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$datanode"}[$__rate_interval]) * 1000) by (instance, pod)` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$frontend"}) by (instance, pod)` | `timeseries` | Current memory usage by instance | `prometheus` | `decbytes` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$frontend"}[$__rate_interval]) * 1000) by (instance, pod)` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]-cpu` |
| Metasrv Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$metasrv"}) by (instance, pod)` | `timeseries` | Current memory usage by instance | `prometheus` | `decbytes` | `[{{ instance }}]-[{{ pod }}]-resident` |
| Metasrv CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$metasrv"}[$__rate_interval]) * 1000) by (instance, pod)` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$flownode"}) by (instance, pod)` | `timeseries` | Current memory usage by instance | `prometheus` | `decbytes` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$flownode"}[$__rate_interval]) * 1000) by (instance, pod)` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Datanode Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$datanode"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{instance=~"$datanode"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{instance}}]-[{{ pod }}]` |
| Datanode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$datanode"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{instance=~"$datanode"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$frontend"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{instance=~"$frontend"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$frontend"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{instance=~"$frontend"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]-cpu` |
| Metasrv Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$metasrv"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{instance=~"$metasrv"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]-resident` |
| Metasrv CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$metasrv"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{instance=~"$metasrv"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode Memory per Instance | `sum(process_resident_memory_bytes{instance=~"$flownode"}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{instance=~"$flownode"})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{instance=~"$flownode"}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{instance=~"$flownode"})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
# Frontend Requests
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
@@ -72,20 +72,28 @@
| Region Worker Handle Bulk Insert Requests | `histogram_quantile(0.95, sum by(le,instance, stage, pod) (rate(greptime_region_worker_handle_write_bucket[$__rate_interval])))`<br/>`sum by(instance, stage, pod) (rate(greptime_region_worker_handle_write_sum[$__rate_interval]))/sum by(instance, stage, pod) (rate(greptime_region_worker_handle_write_count[$__rate_interval]))` | `timeseries` | Per-stage elapsed time for region worker to handle bulk insert region requests. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{stage}}]-P95` |
| Active Series and Field Builders Count | `sum by(instance, pod) (greptime_mito_memtable_active_series_count)`<br/>`sum by(instance, pod) (greptime_mito_memtable_field_builder_count)` | `timeseries` | Compaction oinput output bytes | `prometheus` | `none` | `[{{instance}}]-[{{pod}}]-series` |
| Region Worker Convert Requests | `histogram_quantile(0.95, sum by(le, instance, stage, pod) (rate(greptime_datanode_convert_region_request_bucket[$__rate_interval])))`<br/>`sum by(le,instance, stage, pod) (rate(greptime_datanode_convert_region_request_sum[$__rate_interval]))/sum by(le,instance, stage, pod) (rate(greptime_datanode_convert_region_request_count[$__rate_interval]))` | `timeseries` | Per-stage elapsed time for region worker to decode requests. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{stage}}]-P95` |
| Cache Miss | `sum by (instance,pod, type) (rate(greptime_mito_cache_miss{instance=~"$datanode"}[$__rate_interval]))` | `timeseries` | The local cache miss of the datanode. | `prometheus` | -- | `[{{instance}}]-[{{pod}}]-[{{type}}]` |
# OpenDAL
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| Write P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode", operation =~ "Writer::write\|Writer::close\|write"}[$__rate_interval])))` | `timeseries` | Write P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| List QPS per Instance | `sum by(instance, pod, scheme) (rate(opendal_operation_duration_seconds_count{instance=~"$datanode", operation="list"}[$__rate_interval]))` | `timeseries` | List QPS per Instance. | `prometheus` | `ops` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]` |
| List P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode", operation="list"}[$__rate_interval])))` | `timeseries` | List P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]` |
| Other Requests per Instance | `sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{instance=~"$datanode",operation!~"read\|write\|list\|stat"}[$__rate_interval]))` | `timeseries` | Other Requests per Instance. | `prometheus` | `ops` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Other Request P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode", operation!~"read\|write\|list"}[$__rate_interval])))` | `timeseries` | Other Request P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Other Request P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode", operation!~"read\|write\|list\|Writer::write\|Writer::close\|Reader::read"}[$__rate_interval])))` | `timeseries` | Other Request P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Opendal traffic | `sum by(instance, pod, scheme, operation) (rate(opendal_operation_bytes_sum{instance=~"$datanode"}[$__rate_interval]))` | `timeseries` | Total traffic as in bytes by instance and operation | `prometheus` | `decbytes` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
@@ -102,6 +110,8 @@
| Meta KV Ops Latency | `histogram_quantile(0.99, sum by(pod, le, op, target) (greptime_meta_kv_request_elapsed_bucket))` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `s` | `{{pod}}-{{op}} p99` |
| Rate of meta KV Ops | `rate(greptime_meta_kv_request_elapsed_count[$__rate_interval])` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `none` | `{{pod}}-{{op}} p99` |
| DDL Latency | `histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_tables_bucket))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_table))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_view))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_flow))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_drop_table))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_alter_table))` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `s` | `CreateLogicalTables-{{step}} p90` |
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode",operation="read"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode",operation=~"read|Reader::read"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode", operation="write"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode", operation =~ "Writer::write|Writer::close|write"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode", operation!~"read|write|list"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{instance=~"$datanode", operation!~"read|write|list|Writer::write|Writer::close|Reader::read"}[$__rate_interval])))
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
| Datanode Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)` | `timeseries` | Current memory usage by instance | `prometheus` | `decbytes` | `[{{instance}}]-[{{ pod }}]` |
| Datanode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)` | `timeseries` | Current memory usage by instance | `prometheus` | `decbytes` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]-cpu` |
| Metasrv Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)` | `timeseries` | Current memory usage by instance | `prometheus` | `decbytes` | `[{{ instance }}]-[{{ pod }}]-resident` |
| Metasrv CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)` | `timeseries` | Current memory usage by instance | `prometheus` | `decbytes` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Datanode Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{instance}}]-[{{ pod }}]` |
| Datanode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Frontend CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]-cpu` |
| Metasrv Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]-resident` |
| Metasrv CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode Memory per Instance | `sum(process_resident_memory_bytes{}) by (instance, pod)`<br/>`max(greptime_memory_limit_in_bytes{})` | `timeseries` | Current memory usage by instance | `prometheus` | `bytes` | `[{{ instance }}]-[{{ pod }}]` |
| Flownode CPU Usage per Instance | `sum(rate(process_cpu_seconds_total{}[$__rate_interval]) * 1000) by (instance, pod)`<br/>`max(greptime_cpu_limit_in_millicores{})` | `timeseries` | Current cpu usage by instance | `prometheus` | `none` | `[{{ instance }}]-[{{ pod }}]` |
# Frontend Requests
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
@@ -72,20 +72,28 @@
| Region Worker Handle Bulk Insert Requests | `histogram_quantile(0.95, sum by(le,instance, stage, pod) (rate(greptime_region_worker_handle_write_bucket[$__rate_interval])))`<br/>`sum by(instance, stage, pod) (rate(greptime_region_worker_handle_write_sum[$__rate_interval]))/sum by(instance, stage, pod) (rate(greptime_region_worker_handle_write_count[$__rate_interval]))` | `timeseries` | Per-stage elapsed time for region worker to handle bulk insert region requests. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{stage}}]-P95` |
| Active Series and Field Builders Count | `sum by(instance, pod) (greptime_mito_memtable_active_series_count)`<br/>`sum by(instance, pod) (greptime_mito_memtable_field_builder_count)` | `timeseries` | Compaction oinput output bytes | `prometheus` | `none` | `[{{instance}}]-[{{pod}}]-series` |
| Region Worker Convert Requests | `histogram_quantile(0.95, sum by(le, instance, stage, pod) (rate(greptime_datanode_convert_region_request_bucket[$__rate_interval])))`<br/>`sum by(le,instance, stage, pod) (rate(greptime_datanode_convert_region_request_sum[$__rate_interval]))/sum by(le,instance, stage, pod) (rate(greptime_datanode_convert_region_request_count[$__rate_interval]))` | `timeseries` | Per-stage elapsed time for region worker to decode requests. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{stage}}]-P95` |
| Cache Miss | `sum by (instance,pod, type) (rate(greptime_mito_cache_miss{}[$__rate_interval]))` | `timeseries` | The local cache miss of the datanode. | `prometheus` | -- | `[{{instance}}]-[{{pod}}]-[{{type}}]` |
# OpenDAL
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| Write P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{ operation =~ "Writer::write\|Writer::close\|write"}[$__rate_interval])))` | `timeseries` | Write P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| List QPS per Instance | `sum by(instance, pod, scheme) (rate(opendal_operation_duration_seconds_count{ operation="list"}[$__rate_interval]))` | `timeseries` | List QPS per Instance. | `prometheus` | `ops` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]` |
| List P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme) (rate(opendal_operation_duration_seconds_bucket{ operation="list"}[$__rate_interval])))` | `timeseries` | List P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]` |
| Other Requests per Instance | `sum by(instance, pod, scheme, operation) (rate(opendal_operation_duration_seconds_count{operation!~"read\|write\|list\|stat"}[$__rate_interval]))` | `timeseries` | Other Requests per Instance. | `prometheus` | `ops` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Other Request P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{ operation!~"read\|write\|list"}[$__rate_interval])))` | `timeseries` | Other Request P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Other Request P99 per Instance | `histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{ operation!~"read\|write\|list\|Writer::write\|Writer::close\|Reader::read"}[$__rate_interval])))` | `timeseries` | Other Request P99 per Instance. | `prometheus` | `s` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Opendal traffic | `sum by(instance, pod, scheme, operation) (rate(opendal_operation_bytes_sum{}[$__rate_interval]))` | `timeseries` | Total traffic as in bytes by instance and operation | `prometheus` | `decbytes` | `[{{instance}}]-[{{pod}}]-[{{scheme}}]-[{{operation}}]` |
| Title | Query | Type | Description | Datasource | Unit | Legend Format |
| --- | --- | --- | --- | --- | --- | --- |
@@ -102,6 +110,8 @@
| Meta KV Ops Latency | `histogram_quantile(0.99, sum by(pod, le, op, target) (greptime_meta_kv_request_elapsed_bucket))` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `s` | `{{pod}}-{{op}} p99` |
| Rate of meta KV Ops | `rate(greptime_meta_kv_request_elapsed_count[$__rate_interval])` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `none` | `{{pod}}-{{op}} p99` |
| DDL Latency | `histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_tables_bucket))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_table))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_view))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_create_flow))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_drop_table))`<br/>`histogram_quantile(0.9, sum by(le, pod, step) (greptime_meta_procedure_alter_table))` | `timeseries` | Gauge of load information of each datanode, collected via heartbeat between datanode and metasrv. This information is for metasrv to schedule workloads. | `prometheus` | `s` | `CreateLogicalTables-{{step}} p90` |
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme) (rate(opendal_operation_duration_seconds_bucket{operation="read"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{operation=~"read|Reader::read"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme) (rate(opendal_operation_duration_seconds_bucket{ operation="write"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{ operation =~ "Writer::write|Writer::close|write"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{ operation!~"read|write|list"}[$__rate_interval])))
- expr:histogram_quantile(0.99, sum by(instance, pod, le, scheme, operation) (rate(opendal_operation_duration_seconds_bucket{ operation!~"read|write|list|Writer::write|Writer::close|Reader::read"}[$__rate_interval])))
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.