* refactor: pre-read the ingested sst file in object store to fill the local cache to accelerate first query
* feat: pre-download the ingested SST from remote to accelerate following reads
* resolve PR comments
* resolve PR comments
* feat(log_store): use new `Consumer`
* feat: add `from_peer_id`
* feat: read WAL entries respect index
* test: add test for `build_region_wal_index_iterator`
* fix: keep the handle
* fix: incorrect last index
* fix: replay last entry id may be greater than expected
* chore: remove unused code
* chore: apply suggestions from CR
* chore: rename `datanode_id` to `location_id`
* chore: rename `from_peer_id` to `location_id`
* chore: rename `from_peer_id` to `location_id`
* chore: apply suggestions from CR
* use list_with_metakey and concurrent_stat_in_list
* change concurrent in recover_cache like before
* remove stat funcation
* use 8 concurrent
* use const value
* fmt code
* Apply suggestions from code review
---------
Co-authored-by: ozewr <l19ht@google.com>
Co-authored-by: Weny Xu <wenymedia@gmail.com>
* chore: improve pipeline performance
* chore: use arc to improve time type
* chore: improve pipeline coerce
* chore: add vec refactor
* chore: add vec pp
* chore: improve pipeline
* inprocess
* chore: set log ingester use new pipeline
* chore: fix some error by pr comment
* chore: fix typo
* chore: use enum_dispatch to simplify code
* chore: some minor fix
* chore: format code
* chore: update by pr comment
* chore: fix typo
* chore: make clippy happy
* chore: fix by pr comment
* chore: remove epoch and date process add new timestamp process
* chore: add more test for pipeline
* chore: restore epoch and date processor
* chore: compatibility issue
* chore: fix by pr comment
* chore: move the evaluation out of the loop
* chore: fix by pr comment
* chore: fix dissect output key filter
* chore: fix transform output greptime value has order error
* chore: keep pipeline transform output order
* chore: revert tests
* chore: simplify pipeline prepare implementation
* chore: add test for timestamp pipelin processor
* chore: make clippy happy
* chore: replace is_some check to match
---------
Co-authored-by: shuiyisong <xixing.sys@gmail.com>
* feat: support fast count(*) for append-only tables
* fix: total_rows stats in time series memtable
* fix: sqlness result changes for SinglePartitionScanner -> StreamScanAdapter
* fix: some cr comments
* Add file number limits to TWCS compaction
- Introduce `max_active_window_files` and `max_inactive_window_files` to `TwcsOptions`.
* feat/limit-files-in-windows: Add max active/inactive window files options to mito engine config
* feat/limit-files-in-windows: Add Debug derive to TwcsPicker and implement max file enforcement logging in TWCS compaction
* fix: clippy
* feat: export all schemas and data at onece
* feat: introduce export all to export schemas and data at once
* feat: default value for target
* feat: refactor export target
* chore: fix unit test
* feat: hint options for gRPC isnert
* chore: unit test for extract_hints
* feat: add integration test for grpc hint
* test: add integration test for hints
* feat: add source channel to meter recorders
* feat: provide channel for query context
* fix: testing and extension get for query context
* chore: revert cargo toml structure changes
* fix: querycontext modification for prometheus and pipeline
* chore: switch git dependency to main branches
* chore: remove TODO
* refactor: rename other to unknown
---------
Co-authored-by: shuiyisong <113876041+shuiyisong@users.noreply.github.com>
* feat: refine logs for scan
* feat: improve build parts and unordered scan metrics
* feat: change to debug log
* fix: release lock before reading part
* test: replace region id
* test: fix sqlness
* chore: add todo
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: support setting time range in Copy From statement
* test: add batch_filter_test
* fix: ts data type inconsistent error
* test: add sqlness test for copy from with statement
* fix: sqlness result error
* fix: cr comments
* feat: show root cause on the error line
* feat: show root error for grpc
* feat: add error information for http error
* feat: add db information on error mysql/postgres logs
* feat: add function 'pg_catalog.pg_table_is_visible'q
* feat: add 'pg_class' and 'pg_namespace', now we can run '\d' and '\dt'!
* refactor: move memory_table::tables to utils::tables
* refactor: move out predicate to system_schema to reuse it
* feat: predicates pushdown
* test: add pg_namespace, pg_class related sqlness test
* fix: typos and license header
* fix: sqlness test
* refactor: use `expect` instead of `unwrap` here
* refactor: remove the `information_schema::utils` mod
* doc: make the comment in pg_get_userbyid more precise
* doc: add TODO and comment in pg_catalog
* fix: typo
* fix: sqlness
* doc: change to comment on PGClassBuilder to TODO
* Add dynamic cache size adjustment for InvertedIndexConfig
* Increase cache sizes in integration tests for HTTP
- Updated `metadata_cache_size` from 32MiB to 64MiB
* Remove cache size settings from config and update drop_lines_with_inconsistent_results function to handle them
* Add cache size configurations for inverted index metadata and content
- Introduced `metadata_cache_size` with a default of 64MiB.
- Introduced `content_cache_size` with a default of 128MiB.
* chore/index-content-cache-default-size: Add cache size configuration options for Mito engine's inverted index
fix/reader-metrics:
Refactor cache hit/miss logic and update metrics in mito2
- Simplify cache retrieval logic in CacheManager by removing inline update_hit_miss function call.
- Add separate functions for incrementing cache hit and miss metrics.
- Update RowGroupLastRowCachedReader to use new cache hit/miss functions and refactor to new helper methods for creating Hit and Miss variants.
* chore: update grafana dashboard to reflect recent metric changes
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* chore: add a blank line at the end
---------
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* chore: add a compile cfg for python
* fix: feature gate additive turn off default features in workspace&add cfg in place
* chore: remove unused in different cfg
* fix: ensure keep alive is completed in time
* chore: apply suggestions from CR
* chore: use write runtime
* refactor: set META_LEASE_SECS to 5
* chore: set etcd replicas to 1
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* fix: set `MissedTickBehavior::Delay`
* chore: apply suggestions from CR
* Add caching for last row reader and expose cache manager
- Implement `RowGroupLastRowCachedReader` to handle cache hits and misses for last row reads.
* Add projection field to SelectorResultValue and refactor RowGroupLastRowReader
- Introduced `projection` field in `SelectorResultValue` to store projection indices.
* WIP: pg_catalog
* refactor: move memory_table to crate public level to reuse it in pgcatalog
* refactor: new system_schema mod to manage implementation of information_schema and pg_catalog
* feat: pg_catalog.pg_type
* fix: remove unused code to avoid warning
* test: add pg_catalog sqlness test
* feat: pg_catalog_cache in system_catalog
* fix: integration test
* test: rollback unit test
* refactor: mix pg_catalog table_id with old ones
* fix: add todo information
* tests: rerun sqlness
---------
Co-authored-by: johnsonlee <johnsonlee@localhost.localdomain>
* Add PruneReader for optimized row filtering and error handling
- Introduced `PruneReader` to replace `RowGroupReader` for optimized row filtering.
* Commit Message:
Make ReaderMetrics fields public for external access
* Add row selection support to SeqScan and FileRange readers
- Updated `SeqScan::build_part_sources` to accept an optional `TimeSeriesRowSelector`.
* Refactor `scan_region.rs` to remove unnecessary cloning of `series_row_selector`. Enhance `file_range.rs` by adding `select_all` method to check if all rows in a row group are selected, and update the logic in `reader` method to use `LastRowReader` only when all rows are
selected and no DELETE operations are present.
* Commit Message:
Enhance PruneReader and ParquetReader with reset functionality and metrics handling
Summary:
• Made Source enum public in prune.rs.
* chore: Update src/mito2/src/sst/parquet/reader.rs
---------
Co-authored-by: Yingwen <realevenyag@gmail.com>
* feat: support text/plain format of log input
* refactor: pipeline query and delete using dataframe api
* chore: minor refactor
* refactor: skip jsonify when processing plan/text
* refactor: support array(string) as pipeline engine input
* feat/copy-to-parquet-parameter: Commit Message:
Enhance Parquet Writer with Column-wise Configuration
Summary:
• Introduced column_wise_config function to customize per-column properties in Parquet writer.
* feat/copy-to-parquet-parameter: Commit Message:
Enhance Parquet File Format Handling for Specific Data Types
Summary:
• Added ConcreteDataType import to support specific data type handling.
* feat/copy-to-parquet-parameter: Commit Message:
Refactor Parquet file format configuration
* feat/copy-to-parquet-parameter:
Enhance Parquet file format handling for timestamp columns
- Added logic to disable dictionary encoding and set DELTA_BINARY_PACKED encoding for timestamp columns in the Parquet file format configuration.
* feat/copy-to-parquet-parameter:
Disable dictionary encoding for timestamp columns in Parquet writer and update default max_active_window_runs in TwcsOptions
- Modified Parquet writer to disable dictionary encoding for timestamp columns to optimize for increasing timestamp data.
* feat/copy-to-parquet-parameter:
Update compaction settings in tests
- Modified `test_compaction_region` to include new compaction options: `compaction.type`,
`compaction.twcs.max_active_window_runs`, and `compaction.twcs.max_inactive_window_runs`.
- Updated `test_merge_mode_compaction` to use `compaction.twcs.max_active_window_runs` and
`compaction.twcs.max_inactive_window_runs` instead of `max_active_window_files` and
`max_inactive_window_files`.
* feat: use `Inserter` as Frontend
* fix: enable procedure in flownode
* docs: remove `frontend_addr` opts
* chore: rm fe addr in test runner
* refactor: int test also use inserter invoker
* feat: flow shutdown&refactor: remove `Frontendinvoker`
* refactor: rename `RemoteFrontendInvoker` to `FrontendInvoker`
* refactor: per review
* refactor: remove a layer of box
* fix: standalone use `node_manager`
* fix: remove a `Arc` cycle
* feat/inverted-index-cache:
Update dependencies and add caching for inverted index reader
- Updated `atomic` to 0.6.0 and `uuid` to 1.9.1 in `Cargo.lock`.
- Added `moka` and `uuid` dependencies in `Cargo.toml`.
- Introduced `seek_read` method in `InvertedIndexBlobReader` for common seek and read operations.
- Added `cache.rs` module to implement caching for inverted index reader using `moka`.
- Updated `async-compression` to 0.4.11 in `puffin/Cargo.toml`.
* feat/inverted-index-cache:
Refactor InvertedIndexReader and Add Index Cache Support
- Refactored `InvertedIndexReader` to include `seek_read` method and default implementations for `fst` and `bitmap`.
- Implemented `seek_read` in `InvertedIndexBlobReader` and `CachedInvertedIndexBlobReader`.
- Introduced `InvertedIndexCache` in `CacheManager` and `SstIndexApplier`.
- Updated `SstIndexApplierBuilder` to accept and utilize `InvertedIndexCache`.
- Added `From<FileId> for Uuid` implementation.
* feat/inverted-index-cache:
Update Cargo.toml and refactor SstIndexApplier
- Moved `uuid.workspace` entry in Cargo.toml for better organization.
* feat/inverted-index-cache:
Refactor InvertedIndexCache to use type alias for Arc
- Replaced `Arc<InvertedIndexCache>` with `InvertedIndexCacheRef` type alias.
* feat/inverted-index-cache:
Add Prometheus metrics and caching improvements for inverted index
- Introduced `prometheus` and `puffin` dependencies for metrics.
* feat/inverted-index-cache:
Refactor InvertedIndexReader and Cache handling
- Simplified `InvertedIndexReader` trait by removing seek-related comments.
* feat/inverted-index-cache:
Add configurable cache sizes for inverted index metadata and content
- Introduced `index_metadata_size` and `index_content_size` in `CacheManagerBuilder`.
* feat/inverted-index-cache:
Refactor and optimize inverted index caching
- Removed `metrics.rs` and integrated cache metrics into `index.rs`.
* feat/inverted-index-cache:
Remove unused dependencies from Cargo.lock and Cargo.toml
- Removed `moka`, `prometheus`, and `puffin` dependencies from both Cargo.lock and Cargo.toml.
* feat/inverted-index-cache:
Replace Uuid with FileId in CachedInvertedIndexBlobReader
- Updated `file_id` type from `Uuid` to `FileId` in `CachedInvertedIndexBlobReader` and related methods.
* feat/inverted-index-cache:
Refactor cache configuration for inverted index
- Moved `inverted_index_metadata_cache_size` and `inverted_index_cache_size` from `MitoConfig` to `InvertedIndexConfig`.
* feat/inverted-index-cache:
Remove unnecessary conversion of `file_id` in `SstIndexApplier`
- Simplified the initialization of `CachedInvertedIndexBlobReader` by removing the redundant `into()` conversion for `file_id`.
feat: flownode frontend client&test
feat: Frontend Client
feat: set frontend invoker for flownode
feat: set frontend invoker for flownode
chore: test script
WIP: test flow distributed
feat: hard coded demo
docs: flownode example toml
feat: add flownode support in runner
docs: comments for node
chore: after rebase
docs: add a todo
tests: move flow tests to common
fix: flownode sqlness dist test
chore: per review
docs: make
fix: make doc
* feat: add path prefix label to storage metrics
* refactor: return full path when the levels are less than 3
* refactor: align path label name with upstream
* refactor: better implementation of sub path
---------
Co-authored-by: Weny Xu <wenymedia@gmail.com>
* chore: update sqlness results
* refactor: use rwlock for modifiable data in session and querycontext
* chore: format toml
* refactor: use mutable_inner structure for mutable fields
* refactor: remove arc wrapper
* fix(fuzz): adapt for new partition rules
* feat: implement naive fuzz test for region migration
* chore(ci): add ci cfg
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* fix: forbid to change tables in information_schema
* refactor: use unified read-only check function
* test: add more sqlness tests for information_schema
* refactor: move is_readonly_schema to common_catalog
* feat: add more placeholder field in information_schema.tables
* feat: make schema modifiable for use statement
* chore: add todo items
* fix: resolve lint issues after data type changes
* chore: update sqlness results
* refactor: patch for select database is no longer needed
* test: align tests and data types
* Apply suggestions from code review
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* fix: use canonicalize_identifier for database name
* feat: add all columns for information_schema.tables
* test: remove vairables from sqlness results
* feat: add to_string impl for table options
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* ci: use 'vault.centos.org' as default yum for centos:7 image
* ci: fix cargo-binstall version to adapt rust toolchain
* ci: specify cargo-binstall version to adapt current rust toolchain
* refactor: add Compactor trait
* chore: add compact() in Compactor trait and expose compaction module
* refactor: add CompactionRequest and open_compaction_region
* refactor: export the compaction api
* refactor: add DefaultCompactor::new_from_request
* refactor: no need to pass mito_config in open_compaction_region()
* refactor: CompactionRequest -> &CompactionRequest
* fix: typo
* docs: add docs for public apis
* refactor: remove 'Picker' from Compactor
* chore: add logs
* chore: change pub attribute for Picker
* refactor: remove do_merge_ssts()
* refactor: update comments
* refactor: use CompactionRegion argument in Picker
* chore: make compaction module public and remove unnessary clone
* refactor: move build_compaction_task() in CompactionScheduler{}
* chore: use in open_compaction_region() and add some comments for public structure
* refactor: add 'manifest_dir()' in store-api
* refactor: move the default implementation to DefaultCompactor
* refactor: remove Options from MergeOutput
* chore: minor modification
* fix: clippy errors
* fix: unit test errors
* refactor: remove 'manifest_dir()' from store-api crate(already have one in opener)
* refactor: use 'region_dir' in CompactionRequest
* refactor: refine naming
* refactor: refine naming
* refactor: remove clone()
* chore: add comments
* refactor: add PickerOutput field in CompactorRequest
* feat: introduce RemoteJobScheduler
* feat: add RemoteJobScheudler in schedule_compaction_request()
* refactor: use Option type for senders field of CompactionFinished
* refactor: modify CompactionJob
* refactor: schedule remote compaction job by options
* refactor: remove unused Options
* build: remove unused log
* refactor: fallback to local compaction if the remote compaction failed
* fix: clippy errors
* refactor: add plugins in mito2
* refactor: add from_u64() for JobId
* refactor: make schedule module public
* refactor: add error for RemoteJobScheduler
* refactor: add Notifier
* refactor: use Arc for Notifier
* refactor: add 'remote_compaction' in compaction options
* fix: clippy errors
* fix: unrecognized table option
* refactor: add 'start_time' in CompactionJob
* refactor: modify error type of RemoteJobScheduler
* chore: revert changes for request
* refactor: code refactor by review comment
* refactor: use string type for JobId
* refactor: add 'waiters' field in DefaultNotifier
* fix: build error
* refactor: take coderabbit's review comment
* refactor: use uuid::Uuid as JobId
* refactor: return waiters when schedule failed and add on_failure for DefaultNotifier
* refactor: move waiters from notifier to Job
* refactor: use ObjectStoreManagerRef in open_compaction_region()
* refactor: implement for JobId and adds related unit tests
* fix: run unit tests failed
* refactor: add RemoteJobSchedulerError
* fix: make Influxdb lines able to be inserted into last created tables
* Update src/servers/src/influxdb.rs
* add an option to control the time index alignment behavior
* fix ci
* refactor: use interceptor to handle timestamp align
* Apply suggestions from code review
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: tison <wander4096@gmail.com>
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: Use DATANODE_LEASE_SECS from distributed_time_constants for heartbeat pause duration
* feat: introduce `RegionFailureDetectorController` to manage region failure detectors
* feat: add `RegionFailureDetectorController` to `DdlContext`
* feat: add `region_failure_detector_controller` to `Context` in region migration
* feat: register region failure detectors during rollback region migration procedure
* feat: deregister region failure detectors during drop table procedure
* feat: register region failure detectors during create table procedure
* fix: update meta config
* chore: apply suggestions from CR
* chore: avoid cloning
* chore: rename
* chore: reduce the size of the test
* chore: apply suggestions from CR
* chore: move channel initialization into `RegionSupervisor::channel`
* chore: minor refactor
* chore: rename ident
* fix: add serialize_ignore_column_ids() to fix deserialize region options failed from json string
* refactor: return empty vector if column_id is empty
* feat: add functions to find and merge sorted runs
* chore: refactor code
* chore: remove some duplicates
* chore: remove one clone
* refactor: change max_active_window_files to max_active_window_runs
* feat: integrate with sorted runs
* fix: unit tests
* feat: limit num of sorted runs during compaction
* fix: some test
* fix: some cr comments
* feat: use smallvec
* chore: rebase main
* feat/reduce-sorted-runs:
Refactor compaction logic and update test configurations
- Refactored `merge_all_runs` function to use `sort_ranged_items` for sorting.
- Improved item merging logic by iterating with `into_iter` and handling overlaps.
- Updated test configurations to use `max_active_window_runs` instead of `max_active_window_files` for consistency.
---------
Co-authored-by: tison <wander4096@gmail.com>
* test: add e2e test for region failover
* chore: add ci cfg
* chore: reduce parallelism to 8
* fix(ci): enable region failure
* chore: set sqlx LogLevel to Off
* refactor: move help functions to utils
* feat: add update_mode to region options
* test: add test
* feat: last not null iter
* feat: time series last not null
* feat: partition tree update mode
* feat: partition tree
* fix: last not null iter slice
* test: add test for compaction
* test: use second resolution
* style: fix clippy
* chore: merge two lines
Co-authored-by: Jeremyhi <jiachun_feng@proton.me>
* chore: address CR comments
* refactor: UpdateMode -> MergeMode
* refactor: LastNotNull -> LastNonNull
* chore: return None earlier
* feat: validate region options
make merge mode optional and use default while it is None
* test: fix tests
---------
Co-authored-by: Jeremyhi <jiachun_feng@proton.me>
* refactor: remove compaction_options and use RegionOptions type for region_options
* refactor: add file_purger field in CompactionRegion
* refactor: add SerializedPickerOutput
* refactor: rename CompactorRequest to OpenCompactionRegionRequest and remove PickerOutput
* refactor: use &PickerOutput instead of clone()
* refactor: migrate region failover implementation to region migration
* fix: use HEARTBEAT_INTERVAL_MILLIS as lease secs
* fix: return false if leader is downgraded
* fix: only remove failure detector after submitting procedure successfully
* feat: ignore dropped region
* refactor: retrieve table routes in batches
* refactor: disable region failover on local WAL implementation
* fix: move the guard into procedure
* feat: use real peer addr
* feat: use interval instead of sleep
* chore: rename `HeartbeatSender` to `HeartbeatAcceptor`
* chore: apply suggestions from CR
* chore: reduce duplicate code
* chore: apply suggestions from CR
* feat: lookup peer addr
* chore: add comments
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* feat: introduce bulk memtable encoder/decoder
* chore: rebase main
* chore: resolve some comments
* refactor: only carries time unit in ArraysSorter
* fix: some comments
* feat: herat beat task
* feat: use real flow peer allocator when building
* feat: add peer look up in ddl context
* fix: drop flow test
* refactor: per review(WIP)
* refactor: not check if is alive
* refactor: per review
* refactor: remove useless `reset`
* refactor: per bot advices
* refactor: alive peer
* chore: bot review
* fix: `region_peers` returns same region_id for multi logical tables
* test: add sqlness test for information_schema.region_peers
* refactor: region_peers sqlness
* tests: flow sqlness tests
* tests: WIP df func test
* fix: use schema before expand for transform expr
* tests: some basic flow tests
* tests: unit test
* chore: dep use rev not patch
* fix: wired sqlness error?
* refactor: per review
* fix: temp sqlness bug
* fix: use fixed sqlness
* fix: impl drop as async shutdown
* refactor: per bot's review
* tests: drop worker handler both sync/async
* docs: add rationale for test
* refactor: per review
* chore: fmt
* chore: make RegionOptions serializable and add region_dir in CompactionRegion
* refactor: make `PickerOutput` and `MergeOutput` serializable and deserializable
* refactor: remove Serialize and Deserialize from PickerOutput
* chore: revert changes for file.rs
* chore: revert changes for compactor.rs and compaction.rs
---------
Co-authored-by: tison <wander4096@gmail.com>
* feat: prepare stmt in mysql client
* feat: execute stmt in mysql client
* fix: handle parameters properly
* refactor: use existing funcs to convert expr to scalar value
* refactor: use uuid strings as stmt_key for queries from COM_PREPARE packet
* refactor: take prepare and execute parser as submodule
* test: add unit test for converting expr to scalar value
* feat: deallocate stmt in mysql client
* chore: comments and duplicates
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* refactor: RangeBase
* feat: memtable range
* feat: scanner use mem range
* feat: remove base from mem range context
* feat: impl ranges for memtables
* chore: fix warnings
* refactor: make predicate cheap to clone
* refactor: MemRange -> MemtableRange
* feat: pub empty memtable to fix warnings
* test: fix sqlness result
* chore: call df function types
* feat: RelationDesc to DfSchema
* refactor: use RelationDesc instead of Type
* chore: WIP get to phy expr
* feat: custom deserialize
* chore: fmt
* refactor: renaming to DfScalarFunction
* feat: eval df func(untested)
* fix: had to spawn a thread for calling async
* chore: per review advices
* tests: test df scalar function
* refactor: add Compactor trait
* chore: add compact() in Compactor trait and expose compaction module
* refactor: add CompactionRequest and open_compaction_region
* refactor: export the compaction api
* refactor: add DefaultCompactor::new_from_request
* refactor: no need to pass mito_config in open_compaction_region()
* refactor: CompactionRequest -> &CompactionRequest
* fix: typo
* docs: add docs for public apis
* refactor: remove 'Picker' from Compactor
* chore: add logs
* chore: change pub attribute for Picker
* refactor: remove do_merge_ssts()
* refactor: update comments
* refactor: use CompactionRegion argument in Picker
* chore: make compaction module public and remove unnessary clone
* refactor: move build_compaction_task() in CompactionScheduler{}
* chore: use in open_compaction_region() and add some comments for public structure
* refactor: add 'manifest_dir()' in store-api
* refactor: move the default implementation to DefaultCompactor
* refactor: remove Options from MergeOutput
* chore: minor modification
* fix: clippy errors
* fix: unit test errors
* refactor: remove 'manifest_dir()' from store-api crate(already have one in opener)
* refactor: use 'region_dir' in CompactionRequest
* refactor: refine naming
* refactor: refine naming
* refactor: remove clone()
* chore: add comments
* refactor: add PickerOutput field in CompactorRequest
* refactor: make individual col name optional
* chore: rename TypedPlan's `typ` to `schema`
* feat: add optional col name to typed plan
* feat: pass col name all along
* feat: correct infer output table schema
* chore: unused import
* fix: error when key is not projected
* refactor: per review
* chore: fmt
* ci: enable debug log
* chore: test to reproduce panic
* chore: Revert "ci: enable debug log"
This reverts commit 17eff2a045.
* test: add test for alter during flush
* fix: clear status if region has nothing to flush
It will also executes pending ddls and requests
* docs: fix typo
* feat: support cancellation
* chore: add unit test for cancellation
* chore: minor refactor
* feat: we do not need to spawn in distributed mode
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* set global runtime size
* fix: resolve PR comments
* fix: log the whole option
* fix ci
* debug ci
* debug ci
---------
Co-authored-by: Weny Xu <wenymedia@gmail.com>
* fix: mfp missing rows if run twice in same tick
* tests: run mfp for multiple times
* refactor: make mfp less hacky
* feat: make channel larger
* chore: typos
* fix: display the PartitionBound and PartitionDef correctly
* Update src/partition/src/partition.rs
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* fix: fix unit test of partition definition
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: implement the `WalEntryDistributor` and `WalEntryReceiver`
* test: add tests for `WalEntryDistributor`
* refactor: use bounded channel
* chore: apply suggestions from CR
* feat: unordered scanner
* feat: support compat
* chore: update debug print
fix: missing ranges in scan parts
* fix: ensure chunk size > 0
* fix: parallel is disabled if there is only one file and memtable
* chore: reader metrics
* chore: remove todo
* refactor: add ScanPartBuilder trait
* chore: pass file meta to the part builder
* chore: make part builder private
* docs: update comment
* chore: remove meta()
* refactor: only prune file ranges in ScanInput
replaces ScanPartBuilder with FileRangeCollector which only collect file
ranges
* chore: address typo
* fix: panic when no partition
* feat: Postpone part distribution
* chore: handle empty partition in mito
* style: fix clippy
* show status returning empty contents
* return an empty set instead of affected rows
* chore: Update src/query/src/sql.rs
---------
Co-authored-by: Yingwen <realevenyag@gmail.com>
* feat: open region in background
* feat: trace opening regions
* feat: wait for the opening region
* feat: let engine to handle the future open request
* fix: fix `test_region_registering`
* feat: invoke `flush_table` and `compact_table` in fuzz tests
* feat: support to flush and compact physical metric table
* fix: avoid to create tables with the same name
* feat: validate values after flushing or compacting table
* add compaction udf params
* wip: pass compaction options through grpc
* wip: pass compaction options all the way down to region server
* wip: window compaction task
* feat: trigger major compaction
* refactor: optimize compaction parameter parsing
* chore: rebase main
* chore: update proto
* chore: add some tests
* feat: validate catalog
* chore: fix typo and rebase main
* fix: some cr comments
* fix: file_time_bucket_span
* fix: avoid upper bound overflow
* chore: update proto
* feat: convert timestamp range filters to predicates
* chore: rebase main
* fix: remove prediactes once they have been added to timestamp filters to avoid duplicate filtering
* fix: some comments
* fix: resolve conflicts
* feat: enable gzip in grpc server side
* feat: add enable_gzip_compression config
* test: add grpc compression test
* feat: support user configured compression on grpc server
* chore: update doc
* chore: add tests
* fix: make config-docs
* chore: fix cr issue
* chore: add test
* refactor: remove config on server side, auto enable all compression support
* chore: minor update
* chore: remove unused code
* refactor: enable zstd compression internally by default
* chore: minor fix
* chore: change binary array type from LargeBinaryArray to BinaryArray
* fix: adjust try_into_vector logic
* fix: apply CR suggestions, add tests
* chore: fix failing test
* chore: fix integration test
* chore: adjust the assertions according to changed implementation
* chore: add a test with LargeBinary type
* chore: apply CR suggestions
* chore: simplify tests
* feat(WIP): tumble window rewrite parser
* tests: tumble func
* feat: add `update_at` column for all flow output
* chore: cleanup per review
* fix: update_at not as time index
* fix: demo tumble
* fix: tests&tumble signature&accept both ts&datetime
* refactor: update_at now ts millis type
* chore: per review advices
* feat(flow): flow node manager
feat(flow): render src/sink
feat(flow): flow node manager in standalone
fix?: higher run freq
chore: remove abunant error enum variant
fix: run with higher freq if insert more
chore: fix after rebase
chore: typos
* chore(WIP): per review
* chore: per review
* ci: disable other test
* ci: timeout 30
* ci: try to use lld
* ci: change linker
* test: wait for file change in test multiple times
* ci: enable other tests
* chore: revert sleep in loop
* feat(fuzz): add validator for inserted rows
* fix: compatibility with mysql types
* feat(fuzz): add datetime and date type in mysql for row validator
* ci: use windows 2019
* test: ignore cleanup result
* chore: revert change
* test: unstable repeated task test
* build: update rust toolchain and windows
* ci: test sqlness
* chore: enable other tests
* feat: support to create & drop flow via grpc
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* refactor: passing QueryContext to RegionServer
* refactor: change the return type of build() in QueryContextBuilder
* fix: update greptime-proto reference
* chore: apply suggestion
* chore: revert the last commit
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: mirror insert req to flow node
* refactor: group_requests_by_peer
* chore: rename `nodes` to `flows` to be more apt
* docs: add TODO
* refactor: split flow&data node grouping to two func
* refactor: mirror_flow_node_request
* chore: add some TODOs
* refactor: use Option in value
* feat: skip non-src table quickly
* docs: add TODO for `Peer.address`
* fix: dedup
* refactor: let upper caller control whether to omit column list
* feat(fuzz): add insert logical table target
* ci: add fuzz_insert_logical_table ci cfg
* feat(fuzz): add create logical table target
* fix: drop physical table after fuzz test
* fix: remove backticks of table name in with clause
* fix: create physical and logical table properly
* chore: update comments
* chore(ci): add fuzz_create_logical_table ci cfg
* fix: create one logical table once a time
* fix: avoid possible duplicate table and column name
* feat: use hard-code physical table
* chore: remove useless phantom
* refactor: create logical table with struct initialization
* chore: suggested changes and corresponding test changes
* chore: clean up
* fix: post process result on query full column name of prom labels API
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* only preserve tag column
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
---------
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* feat: support different types for `CompatReader`
* chore: only compare whether we need: (data_type)
* fix: optimize code based on review suggestions
- add unit test `test_safe_cast_to_null` to test safely cast
- add DataType to projected_fields
- remove TODO
* fix: assert_eq fail on `projection.rs`
* style: codefmt
* style: fix the code based on review suggestions
* fix: do not remove deletion markers when window time range overlaps
* chore: fix some minor issues; add compaction test
* chore: add more test
* fix: nitpick master's nitpick
* feat: support invalidate schema name key cache
* fix: remove pub for invalidate_schema_cache
* refactor: add DropMetadataBroadcast State Op
* fix: delete files
* feat: transofrm substrait SELECT&WHERE&GROUP BY to Flow Plan
* chore: reexport from common/substrait
* feat: use datafusion Aggr Func to map to Flow aggr func
* chore: remove unwrap&split literal
* refactor: split transform.rs into smaller files
* feat: apply optimize for variadic fn
* refactor: split unit test
* chore: per review
* fix: cli export `create table` with quoted names
* add test
* apply review comments
* fix to pass check
* remove eprintln for clippy check
* use prebuilt binary to avoid compile
* ci run coverage after build
* drop dirty hack test
Signed-off-by: tison <wander4096@gmail.com>
---------
Signed-off-by: tison <wander4096@gmail.com>
Co-authored-by: tison <wander4096@gmail.com>
* refactor: func's specialization& use Error not EvalError
* docs: some pub item
* chore: typo
* docs: add comments for every pub item
* chore: per review
* chore: per reveiw&derive Copy
* chore: per review&test for binary fn spec
* docs: comment explain how binary func spec works
* chore: minor style change
* fix: Error not EvalError
* chore: keep the same method order in KvBackend
* feat: make meta client can get all node info of cluster
* feat: cluster info data model
* feat: frontend and datanode info
* feat: list node info
* chore: remove the method: is_started
* fix: scan key prefix
* chore: impl From for NodeInfoKey
* chore: doc for trait and struct
* chore: reuse the error
* chore: refactor two collec cluster info handlers
* chore: remove inline
* chore: refactor two collec cluster info handlers
* fix: move object store read/write timer into inner
* add Drop for PrometheusMetricWrapper
* call await on async read/write
* apply review comments
* git rid of option on timer
* test: add integration_test for datetime style
* feat: support various datestyle for postgres
* doc: rewrite the comment about merge_datestyle_value
* test: add more test to illustrate valid datestyle input
* feat: add __schema__ tag for promql parser
* feat: disable matcher op other than equals
* test: add more test to ensure context getting reset
* test: add integration test
* test: refactor tests
* refactor: remove duplicated test code
* refactor: update according to review comments
* test: add sqlness test for cross schema scenario
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* feat(tql): add initial support for start,stop,step as sql functions
* fix(tql): remove unwraps, adjust fmt
* fix(tql): address taplo issue
* feat(tql): update parse_tql_query logic
* fix(tql): change query parsing logic to use parser instead of delimiter
* fix(tql): add timestamp function support, add sqlness tests
* fix(tql): add lookback optional param for tql eval
* fix(tql): adjust tests for now() function
* fix(tql): introduce the tqlerror to differentiate failures on parsing, evaluation and simplification stages
* fix(tql): add tests for explain/analyze
* feat(tql): add lookback support for explain/analyze, update tests
* feat(tql): add more sqlness tests
* chore(tql): extract common logic for eval, analyze and explain into a single function
* feat(tql): address CR points
* feat(tql): use snafu for tql errors, add more docs
* feat(tql): address CR points
* test: add integration tests for kafka wal
* chore: rebase main
* chore: unify naming convention for wal config
* chore: add register loaders switch
* chore: alter tables by adding a new column
* chore: move rand to dev-dependencies
* chore: update Cargo.lock
* feat: support set variable statement of session
* feat: support printing postgresql's bytea data type in its "hex" and "escape" format in ugly way
* refactor: add 'SessionConfigValue' type and unify the name
* doc: add license header
* refactor: confine coupling with 'sql::ast::Value' in SessionConfigValue
* refactor: move all bytea wrapper into bytea.rs
* fix: remove unused import in context.rs and postgres.rs
* refactor: rename 'set_configuration_parameter' to 'set_session_config'
rename 'set_configuration_parameter' in statement_.rs to 'set_session_config'
* refactor: use mod to organize options via macro
* refactor: re-model the session config value with static type
* test: add integration test
* refactor: move the encode bytea by format type logic into encoder
refactor: use Arc<DashMap> instead of DashMap in QueryContext
refactor: use Arc<DashMap> instead of DashMap in QueryContext
Avoid expensive clone
refactor: use unreachable!() instead of unimplemented!()
refactor: move the encode bytea by format type logic into encoder
test: add binary format integration test case
* test: add ut for byte related type
* doc: remove TODO of bytea_output
* refactor: simplify the implementation with simple struct instead of complex typing
* fix: typo of 'Available'
* fix compile
Signed-off-by: tison <wander4096@gmail.com>
---------
Signed-off-by: tison <wander4096@gmail.com>
Co-authored-by: tison <wander4096@gmail.com>
* feat: Able to pretty print sql query result in http output
* fix: add some tests
* fix: add some space, delete fn into_payload, and impl Display for TableResponse
* feat: Arrangement shared state
* feat: arrange&tests
* docs: detailed&tests for get
* chore: license
* refactor: opt out ts expr&tests: internal ts
* docs: remove some TODOs
* feat: use smallvec size of 2
* refactor: per review
* chore: per review
* chore: per review
* chore: remove reduant clone
* feat: return max expire time&docs: more explain cur expire config
* feat: add memtable builder to region
* refactor: rename memtable_builder in worker to default_memtable_builder
* fix: return error instead of using default compaction options
Support deserializing memtable and compaction options from the option
map
* feat: optional memtable options
* feat: add MemtableBuilderProvider to create builders
* feat: change default memtable and skip deserializing dedup
* chore: update test and comment
* chore: test invalid type
* feat: metric engine use new memtable manually
* feat: expose more memtable configs
* feat: add memtable options to valid option list
* test: add test
* test: sqlness test
* chore: serde workspace
* chore: remove comments
* feat: handle flush periodically
* chore: call periodical method in loop
* feat: check periodical tasks on channel timeout
* refactor: use time provider to get time
Mock a time provider to test auto flush
* chore: fix typos
* refactor: rename mock time provider
* style: fix cilppy
* chore: address comment
* feat: acquire catalog and schema lock in region failover
* chore: remove unused code
* feat!: acquire catalog and schema lock in region migration
* feat: acquire catalog and schema lock in create table
* feat: call freeze if the active data buffer in a shard is full
* chore: more metrics
* chore: print metrics
* chore: enlarge freeze threshold
* test: test freeze
* test: fix config test
* feat(influxdb): add db query param support for v2 write api
* fix(influxdb): update authorize logic to get catalog and schema from query string
* fix(influxdb): address CR suggestions
* fix(influxdb): use the correct import
* feat: add create table fuzz test
* chore: add ci cfg for fuzz tests
* refactor: remove redundant nightly config
* chore: run fuzz test in debug mode
* chore: use ubuntu-latest
* fix: close connection
* chore: add cache in fuzz test ci
* chore: apply suggestion from CR
* chore: apply suggestion from CR
* chore: refactor the fuzz test action
release-dev-builder-images-cn: # Note: Be careful issue:https://github.com/containers/skopeo/issues/1874 and we decide to use the latest stable skopeo container.
Thanks a lot for considering contributing to GreptimeDB. We believe people like you would make GreptimeDB a great product. We intend to build a community where individuals can have open talks, show respect for one another, and speak with true ❤️. Meanwhile, we are to keep transparency and make your effort count here.
Please read the guidelines, and they can help you get started. Communicate with respect to developers maintaining and developing the project. In return, they should reciprocate that respect by addressing your issue, reviewing changes, as well as helping finalize and merge your pull requests.
You can find our contributors at https://github.com/GreptimeTeam/greptimedb/graphs/contributors. When you dedicate to GreptimeDB for a few months and keep bringing high-quality contributions (code, docs, advocate, etc.), you will be a candidate of a committer.
A committer will be granted both read & write access to GreptimeDB repos. Check the [AUTHOR.md](AUTHOR.md) file for all current individual committers.
Please read the guidelines, and they can help you get started. Communicate respectfully with the developers maintaining and developing the project. In return, they should reciprocate that respect by addressing your issue, reviewing changes, as well as helping finalize and merge your pull requests.
Follow our [README](https://github.com/GreptimeTeam/greptimedb#readme) to get the whole picture of the project. To learn about the design of GreptimeDB, please refer to the [design docs](https://github.com/GrepTimeTeam/docs).
@@ -10,7 +14,7 @@ Follow our [README](https://github.com/GreptimeTeam/greptimedb#readme) to get th
It can feel intimidating to contribute to a complex project, but it can also be exciting and fun. These general notes will help everyone participate in this communal activity.
- Follow the [Code of Conduct](https://github.com/GreptimeTeam/greptimedb/blob/main/CODE_OF_CONDUCT.md)
- Follow the [Code of Conduct](https://github.com/GreptimeTeam/.github/blob/main/.github/CODE_OF_CONDUCT.md)
- Small changes make huge differences. We will happily accept a PR making a single character change if it helps move forward. Don't wait to have everything working.
- Check the closed issues before opening your issue.
- Try to follow the existing style of the code.
@@ -26,7 +30,7 @@ Pull requests are great, but we accept all kinds of other help if you like. Such
## Code of Conduct
Also, there are things that we are not looking for because they don't match the goals of the product or benefit the community. Please read [Code of Conduct](https://github.com/GreptimeTeam/greptimedb/blob/main/CODE_OF_CONDUCT.md); we hope everyone can keep good manners and become an honored member.
Also, there are things that we are not looking for because they don't match the goals of the product or benefit the community. Please read [Code of Conduct](https://github.com/GreptimeTeam/.github/blob/main/.github/CODE_OF_CONDUCT.md); we hope everyone can keep good manners and become an honored member.
## License
@@ -50,8 +54,8 @@ GreptimeDB uses the [Apache 2.0 license](https://github.com/GreptimeTeam/greptim
- To ensure that community is free and confident in its ability to use your contributions, please sign the Contributor License Agreement (CLA) which will be incorporated in the pull request process.
- Make sure all files have proper license header (running `docker run --rm -v $(pwd):/github/workspace ghcr.io/korandoru/hawkeye-native:v3 format` from the project root).
- Make sure all your codes are formatted and follow the [coding style](https://pingcap.github.io/style-guide/rust/).
- Make sure all unit tests are passed (using `cargo test --workspace` or [nextest](https://nexte.st/index.html) `cargo nextest run`).
- Make sure all your codes are formatted and follow the [coding style](https://pingcap.github.io/style-guide/rust/) and [style guide](docs/style-guide.md).
- Make sure all unit tests are passed using [nextest](https://nexte.st/index.html) `cargo nextest run`.
- Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings`).
GreptimeDB is an open-source time-series database focusing on efficiency, scalability, and analytical capabilities.
It's designed to work on infrastructure of the cloud era, and users benefit from its elasticity and commodity storage.
## Introduction
Our core developers have been building time-series data platforms for years. Based on their best-practices, GreptimeDB is born to give you:
**GreptimeDB** is an open-source unified time-series database for **Metrics**, **Logs**, and **Events** (also **Traces** in plan). You can gain real-time insights from Edge to Cloud at any scale.
- Optimized columnar layout for handling time-series data; compacted, compressed, and stored on various storage backends, particularly cloud object storage with 50x cost efficiency.
- Fully open-source distributed cluster architecture that harnesses the power of cloud-native elastic computing resources.
- Seamless scalability from a standalone binary at edge to a robust, highly available distributed cluster in cloud, with a transparent experience for both developers and administrators.
- Native SQL and PromQL for queries, and Python scripting to facilitate complex analytical tasks.
- Flexible indexing capabilities and distributed, parallel-processing query engine, tackling high cardinality issues down.
- Widely adopted database protocols and APIs, including MySQL, PostgreSQL, and Prometheus Remote Storage, etc.
## Why GreptimeDB
## Quick Start
Our core developers have been building time-series data platforms for years. Based on our best-practices, GreptimeDB is born to give you:
GreptimeDB treats all time series as contextual events with timestamp, and thus unifies the processing of metrics, logs, and events. It supports analyzing metrics, logs, and events with SQL and PromQL, and doing streaming with continuous aggregation.
* **Cloud-Edge collaboration**
GreptimeDB can be deployed on ARM architecture-compatible Android/Linux systems as well as cloud environments from various vendors. Both sides run the same software, providing identical APIs and control planes, so your application can run at the edge or on the cloud without modification, and data synchronization also becomes extremely easy and efficient.
* **Cloud-native distributed database**
By leveraging object storage (S3 and others), separating compute and storage, scaling stateless compute nodes arbitrarily, GreptimeDB implements seamless scalability. It also supports cross-cloud deployment with a built-in unified data access layer over different object storages.
* **Performance and Cost-effective**
Flexible indexing capabilities and distributed, parallel-processing query engine, tackling high cardinality issues down. Optimized columnar layout for handling time-series data; compacted, compressed, and stored on various storage backends, particularly cloud object storage with 50x cost efficiency.
* **Compatible with InfluxDB, Prometheus and more protocols**
Widely adopted database protocols and APIs, including MySQL, PostgreSQL, and Prometheus Remote Storage, etc. [Read more](https://docs.greptime.com/user-guide/clients/overview).
- C/C++ Toolchain: provides basic tools for compiling and linking. This is
available as `build-essential` on ubuntu and similar name on other platforms.
- Rust: the easiest way to install Rust is to use
[`rustup`](https://rustup.rs/), which will check our `rust-toolchain` file and
install correct Rust version for you.
- Protobuf: `protoc` is required for compiling `.proto` files. `protobuf` is
available from major package manager on macos and linux distributions. You can
find an installation instructions [here](https://grpc.io/docs/protoc-installation/).
**Note that `protoc` version needs to be >= 3.15** because we have used the `optional`
keyword. You can check it with `protoc --version`.
- python3-dev or python3-devel(Optional feature, only needed if you want to run scripts
in CPython, and also need to enable `pyo3_backend` feature when compiling(by `cargo run -F pyo3_backend` or add `pyo3_backend` to src/script/Cargo.toml 's `features.default` like `default = ["python", "pyo3_backend]`)): this install a Python shared library required for running Python
scripting engine(In CPython Mode). This is available as `python3-dev` on
ubuntu, you can install it with `sudo apt install python3-dev`, or
`python3-devel` on RPM based distributions (e.g. Fedora, Red Hat, SuSE). Mac's
`Python3` package should have this shared library by default. More detail for compiling with PyO3 can be found in [PyO3](https://pyo3.rs/v0.18.1/building_and_distribution#configuring-the-python-version)'s documentation.
To install GreptimeDB locally, the recommended way is via Docker:
#### Build with Docker
A docker image with necessary dependencies is provided:
* Python toolchain (optional): Required only if built with PyO3 backend. More detail for compiling with PyO3 can be found in its [documentation](https://pyo3.rs/v0.18.1/building_and_distribution#configuring-the-python-version).
Build GreptimeDB binary:
```shell
make
```
Run a standalone server:
```shell
cargo run -- standalone start
```
Or if you built from docker:
```
docker run -p 4002:4002 -v "$(pwd):/tmp/greptimedb" greptime/greptimedb standalone start
```
Please see the online document site for more installation options and [operations info](https://docs.greptime.com/user-guide/operations/overview).
### Get started
Read the [complete getting started guide](https://docs.greptime.com/getting-started/overview) on our [official document site](https://docs.greptime.com/).
To write and query data, GreptimeDB is compatible with multiple [protocols and clients](https://docs.greptime.com/user-guide/clients/overview).
For Linux and macOS, you can easily download pre-built binaries including official releases and nightly builds that are ready to use.
In most cases, downloading the version without PyO3 is sufficient. However, if you plan to run scripts in CPython (and use Python packages like NumPy and Pandas), you will need to download the version with PyO3 and install a Python with the same version as the Python in the PyO3 version.
We recommend using virtualenv for the installation process to manage multiple Python versions.
Our official Grafana dashboard is available at [grafana](./grafana/README.md) directory.
Our official Grafana dashboard is available at [grafana](grafana/README.md) directory.
## Project Status
This project is in its early stage and under heavy development. We move fast and
break things. Benchmark on development branch may not represent its potential
performance. We release pre-built binaries constantly for functional
evaluation. Do not use it in production at the moment.
The current version has not yet reached the standards for General Availability.
According to our Greptime 2024 Roadmap, we aim to achieve a production-level version with the release of v1.0 by the end of 2024. [Join Us](https://github.com/GreptimeTeam/greptimedb/issues/3412)
For future plans, check out [GreptimeDB roadmap](https://github.com/GreptimeTeam/greptimedb/issues/669).
We welcome you to test and use GreptimeDB. Some users have already adopted it in their production environments. If you're interested in trying it out, please use the latest stable release available.
## Community
@@ -154,15 +163,22 @@ and what went wrong. If you have any questions or if you would like to get invol
community, please check out:
- GreptimeDB Community on [Slack](https://greptime.com/slack)
- Greptime official [website](https://greptime.com)
In addition, you may:
- View our official [Blog](https://greptime.com/blogs/index)
- View our official [Blog](https://greptime.com/blogs/)
- Connect us with [Linkedin](https://www.linkedin.com/company/greptime/)
- Follow us on [Twitter](https://twitter.com/greptime)
## Commerial Support
If you are running GreptimeDB OSS in your organization, we offer additional
enterprise addons, installation service, training and consulting. [Contact
us](https://greptime.com/contactus) and we will reach out to you with more
detail of our commerial license.
## License
GreptimeDB uses the [Apache License 2.0](https://apache.org/licenses/LICENSE-2.0.txt) to strike a balance between
@@ -170,10 +186,12 @@ open contributions and allowing you to use the software however you want.
## Contributing
Please refer to [contribution guidelines](CONTRIBUTING.md) for more information.
Please refer to [contribution guidelines](CONTRIBUTING.md) and [internal concepts docs](https://docs.greptime.com/contributor-guide/overview.html) for more information.
## Acknowledgement
Special thanks to all the contributors who have propelled GreptimeDB forward. For a complete list of contributors, please refer to [AUTHOR.md](AUTHOR.md).
- GreptimeDB uses [Apache Arrow™](https://arrow.apache.org/) as the memory model and [Apache Parquet™](https://parquet.apache.org/) as the persistent file format.
- GreptimeDB's query engine is powered by [Apache Arrow DataFusion™](https://arrow.apache.org/datafusion/).
- [Apache OpenDAL™](https://opendal.apache.org) gives GreptimeDB a very general and elegant data access abstraction layer.
| `grpc.tls.watch` | Bool | `false` | Watch for Certificate and key file change and auto reload.<br/>For now, gRPC tls config does not support auto reload. |
| `prom_store.enable` | Bool | `true` | Whether to enable Prometheus remote write and read in HTTP API. |
| `prom_store.with_metric_engine` | Bool | `true` | Whether to store the data from Prometheus remote write in metric engine. |
| `wal` | -- | -- | The WAL options. |
| `wal.provider` | String | `raft_engine` | The provider of the WAL.<br/>- `raft_engine`: the wal is stored in the local file system by raft-engine.<br/>- `kafka`: it's remote wal that data is stored in Kafka. |
| `wal.dir` | String | `None` | The directory to store the WAL files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.file_size` | String | `256MB` | The size of the WAL segment file.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_threshold` | String | `4GB` | The threshold of the WAL size to trigger a flush.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_interval` | String | `10m` | The interval to trigger a flush.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.read_batch_size` | Integer | `128` | The read batch size.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.sync_write` | Bool | `false` | Whether to use sync write.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.enable_log_recycle` | Bool | `true` | Whether to reuse logically truncated log files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.prefill_log_files` | Bool | `false` | Whether to pre-create log files on start up.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.sync_period` | String | `10s` | Duration for fsyncing log files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.broker_endpoints` | Array | -- | The Kafka broker endpoints.<br/>**It's only used when the provider is `kafka`**. |
| `wal.auto_create_topics` | Bool | `true` | Automatically create topics for WAL.<br/>Set to `true` to automatically create topics for WAL.<br/>Otherwise, use topics named `topic_name_prefix_[0..num_topics)` |
| `wal.num_topics` | Integer | `64` | Number of topics.<br/>**It's only used when the provider is `kafka`**. |
| `wal.selector_type` | String | `round_robin` | Topic selector type.<br/>Available selector types:<br/>- `round_robin` (default)<br/>**It's only used when the provider is `kafka`**. |
| `wal.topic_name_prefix` | String | `greptimedb_wal_topic` | A Kafka topic is constructed by concatenating `topic_name_prefix` and `topic_id`.<br/>i.g., greptimedb_wal_topic_0, greptimedb_wal_topic_1.<br/>**It's only used when the provider is `kafka`**. |
| `wal.replication_factor` | Integer | `1` | Expected number of replicas of each partition.<br/>**It's only used when the provider is `kafka`**. |
| `wal.create_topic_timeout` | String | `30s` | Above which a topic creation operation will be cancelled.<br/>**It's only used when the provider is `kafka`**. |
| `wal.max_batch_bytes` | String | `1MB` | The max size of a single producer batch.<br/>Warning: Kafka has a default limit of 1MB per message in a topic.<br/>**It's only used when the provider is `kafka`**. |
| `wal.consumer_wait_timeout` | String | `100ms` | The consumer wait timeout.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_init` | String | `500ms` | The initial backoff delay.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_max` | String | `10s` | The maximum backoff delay.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_base` | Integer | `2` | The exponential backoff rate, i.e. next backoff = base * current backoff.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_deadline` | String | `5mins` | The deadline of retries.<br/>**It's only used when the provider is `kafka`**. |
| `storage` | -- | -- | The data storage options. |
| `storage.data_home` | String | `/tmp/greptimedb/` | The working home directory. |
| `storage.type` | String | `File` | The storage type used to store the data.<br/>- `File`: the data is stored in the local file system.<br/>- `S3`: the data is stored in the S3 object storage.<br/>- `Gcs`: the data is stored in the Google Cloud Storage.<br/>- `Azblob`: the data is stored in the Azure Blob Storage.<br/>- `Oss`: the data is stored in the Aliyun OSS. |
| `storage.cache_path` | String | `None` | Cache configuration for object storage such as 'S3' etc.<br/>The local file cache directory. |
| `storage.cache_capacity` | String | `None` | The local file cache capacity in bytes. |
| `storage.bucket` | String | `None` | The S3 bucket name.<br/>**It's only used when the storage type is `S3`, `Oss` and `Gcs`**. |
| `storage.root` | String | `None` | The S3 data will be stored in the specified prefix, for example, `s3://${bucket}/${root}`.<br/>**It's only used when the storage type is `S3`, `Oss` and `Azblob`**. |
| `storage.access_key_id` | String | `None` | The access key id of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3` and `Oss`**. |
| `storage.secret_access_key` | String | `None` | The secret access key of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3`**. |
| `storage.access_key_secret` | String | `None` | The secret access key of the aliyun account.<br/>**It's only used when the storage type is `Oss`**. |
| `storage.account_name` | String | `None` | The account key of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.account_key` | String | `None` | The account key of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.scope` | String | `None` | The scope of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.credential_path` | String | `None` | The credential path of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.credential` | String | `None` | The credential of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.container` | String | `None` | The container of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.sas_token` | String | `None` | The sas token of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.endpoint` | String | `None` | The endpoint of the S3 service.<br/>**It's only used when the storage type is `S3`, `Oss`, `Gcs` and `Azblob`**. |
| `storage.region` | String | `None` | The region of the S3 service.<br/>**It's only used when the storage type is `S3`, `Oss`, `Gcs` and `Azblob`**. |
| `[[region_engine]]` | -- | -- | The region engine options. You can configure multiple region engines. |
| `region_engine.mito.num_workers` | Integer | `8` | Number of region workers. |
| `region_engine.mito.worker_channel_size` | Integer | `128` | Request channel size of each worker. |
| `region_engine.mito.worker_request_batch_size` | Integer | `64` | Max batch size for a worker to handle requests. |
| `region_engine.mito.manifest_checkpoint_distance` | Integer | `10` | Number of meta action updated to trigger a new checkpoint for the manifest. |
| `region_engine.mito.compress_manifest` | Bool | `false` | Whether to compress manifest and checkpoint file by gzip (default false). |
| `region_engine.mito.max_background_jobs` | Integer | `4` | Max number of running background jobs |
| `region_engine.mito.auto_flush_interval` | String | `1h` | Interval to auto flush a region if it has not flushed yet. |
| `region_engine.mito.global_write_buffer_size` | String | `1GB` | Global write buffer size for all regions. If not set, it's default to 1/8 of OS memory with a max limitation of 1GB. |
| `region_engine.mito.global_write_buffer_reject_size` | String | `2GB` | Global write buffer size threshold to reject write requests. If not set, it's default to 2 times of `global_write_buffer_size` |
| `region_engine.mito.sst_meta_cache_size` | String | `128MB` | Cache size for SST metadata. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/32 of OS memory with a max limitation of 128MB. |
| `region_engine.mito.vector_cache_size` | String | `512MB` | Cache size for vectors and arrow arrays. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB. |
| `region_engine.mito.page_cache_size` | String | `512MB` | Cache size for pages of SST row groups. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/8 of OS memory. |
| `region_engine.mito.selector_result_cache_size` | String | `512MB` | Cache size for time series selector (e.g. `last_value()`). Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB. |
| `region_engine.mito.enable_experimental_write_cache` | Bool | `false` | Whether to enable the experimental write cache. |
| `region_engine.mito.experimental_write_cache_path` | String | `""` | File system path for write cache, defaults to `{data_home}/write_cache`. |
| `region_engine.mito.scan_parallelism` | Integer | `0` | Parallelism to scan a region (default: 1/4 of cpu cores).<br/>- `0`: using the default value (1/4 of cpu cores).<br/>- `1`: scan in current thread.<br/>- `n`: scan in parallelism n. |
| `region_engine.mito.parallel_scan_channel_size` | Integer | `32` | Capacity of the channel to send data from parallel scan tasks to the main task. |
| `region_engine.mito.allow_stale_entries` | Bool | `false` | Whether to allow stale WAL entries read during replay. |
| `region_engine.mito.index` | -- | -- | The options for index in Mito engine. |
| `region_engine.mito.index.aux_path` | String | `""` | Auxiliary directory path for the index in filesystem, used to store intermediate files for<br/>creating the index and staging files for searching the index, defaults to `{data_home}/index_intermediate`.<br/>The default name for this directory is `index_intermediate` for backward compatibility.<br/><br/>This path contains two subdirectories:<br/>- `__intm`: for storing intermediate files used during creating index.<br/>- `staging`: for storing staging files used during searching index. |
| `region_engine.mito.index.staging_size` | String | `2GB` | The max capacity of the staging directory. |
| `region_engine.mito.inverted_index` | -- | -- | The options for inverted index in Mito engine. |
| `region_engine.mito.inverted_index.create_on_flush` | String | `auto` | Whether to create the index on flush.<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.inverted_index.create_on_compaction` | String | `auto` | Whether to create the index on compaction.<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.inverted_index.apply_on_query` | String | `auto` | Whether to apply the index on query<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.inverted_index.mem_threshold_on_create` | String | `auto` | Memory threshold for performing an external sort during index creation.<br/>- `auto`: automatically determine the threshold based on the system memory size (default)<br/>- `unlimited`: no memory limit<br/>- `[size]` e.g. `64MB`: fixed memory threshold |
| `region_engine.mito.inverted_index.metadata_cache_size` | String | `64MiB` | Cache size for inverted index metadata. |
| `region_engine.mito.inverted_index.content_cache_size` | String | `128MiB` | Cache size for inverted index content. |
| `region_engine.mito.fulltext_index` | -- | -- | The options for full-text index in Mito engine. |
| `region_engine.mito.fulltext_index.create_on_flush` | String | `auto` | Whether to create the index on flush.<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.fulltext_index.create_on_compaction` | String | `auto` | Whether to create the index on compaction.<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.fulltext_index.apply_on_query` | String | `auto` | Whether to apply the index on query<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.fulltext_index.mem_threshold_on_create` | String | `auto` | Memory threshold for index creation.<br/>- `auto`: automatically determine the threshold based on the system memory size (default)<br/>- `unlimited`: no memory limit<br/>- `[size]` e.g. `64MB`: fixed memory threshold |
| `region_engine.mito.memtable.index_max_keys_per_shard` | Integer | `8192` | The max number of keys in one shard.<br/>Only available for `partition_tree` memtable. |
| `region_engine.mito.memtable.data_freeze_threshold` | Integer | `32768` | The max rows of data inside the actively writing buffer in one shard.<br/>Only available for `partition_tree` memtable. |
| `region_engine.mito.memtable.fork_dictionary_bytes` | String | `1GiB` | Max dictionary bytes.<br/>Only available for `partition_tree` memtable. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.sendto`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.self_import` | -- | -- | For `standalone` mode, `self_import` is recommend to collect metrics generated by itself<br/>You must create the database before enabling it. |
| `export_metrics.remote_write.url` | String | `""` | The url the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
| `grpc.tls.watch` | Bool | `false` | Watch for Certificate and key file change and auto reload.<br/>For now, gRPC tls config does not support auto reload. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.sendto`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.self_import` | -- | -- | For `standalone` mode, `self_import` is recommend to collect metrics generated by itself<br/>You must create the database before enabling it. |
| `export_metrics.remote_write.url` | String | `""` | The url the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
| `tracing` | -- | -- | The tracing options. Only effect when compiled with `tokio-console` feature. |
| `tracing.tokio_console_addr` | String | `None` | The tokio console address. |
### Metasrv
| Key | Type | Default | Descriptions |
| --- | -----| ------- | ----------- |
| `data_home` | String | `/tmp/metasrv/` | The working home directory. |
| `bind_addr` | String | `127.0.0.1:3002` | The bind address of metasrv. |
| `server_addr` | String | `127.0.0.1:3002` | The communication server address for frontend and datanode to connect to metasrv, "127.0.0.1:3002" by default for localhost. |
| `store_addr` | String | `127.0.0.1:2379` | Store server address default to etcd store. |
| `store_key_prefix` | String | `""` | If it's not empty, the metasrv will store all data with this key prefix. |
| `enable_region_failover` | Bool | `false` | Whether to enable region failover.<br/>This feature is only available on GreptimeDB running on cluster mode and<br/>- Using Remote WAL<br/>- Using shared storage (e.g., s3). |
| `backend` | String | `EtcdStore` | The datastore for meta server. |
| `runtime` | -- | -- | The runtime options. |
| `runtime.global_rt_size` | Integer | `8` | The number of threads to execute the runtime for global read operations. |
| `runtime.compact_rt_size` | Integer | `4` | The number of threads to execute the runtime for global write operations. |
| `procedure.max_metadata_value_size` | String | `1500KiB` | Auto split large value<br/>GreptimeDB procedure uses etcd as the default metadata storage backend.<br/>The etcd the maximum size of any request is 1.5 MiB<br/>1500KiB = 1536KiB (1.5MiB) - 36KiB (reserved size of key)<br/>Comments out the `max_metadata_value_size`, for don't split large value (no limit). |
| `failure_detector` | -- | -- | -- |
| `failure_detector.threshold` | Float | `8.0` | The threshold value used by the failure detector to determine failure conditions. |
| `failure_detector.min_std_deviation` | String | `100ms` | The minimum standard deviation of the heartbeat intervals, used to calculate acceptable variations. |
| `failure_detector.acceptable_heartbeat_pause` | String | `10000ms` | The acceptable pause duration between heartbeats, used to determine if a heartbeat interval is acceptable. |
| `failure_detector.first_heartbeat_estimate` | String | `1000ms` | The initial estimate of the heartbeat interval used by the failure detector. |
| `wal.broker_endpoints` | Array | -- | The broker endpoints of the Kafka cluster. |
| `wal.auto_create_topics` | Bool | `true` | Automatically create topics for WAL.<br/>Set to `true` to automatically create topics for WAL.<br/>Otherwise, use topics named `topic_name_prefix_[0..num_topics)` |
| `wal.num_topics` | Integer | `64` | Number of topics. |
| `wal.topic_name_prefix` | String | `greptimedb_wal_topic` | A Kafka topic is constructed by concatenating `topic_name_prefix` and `topic_id`.<br/>i.g., greptimedb_wal_topic_0, greptimedb_wal_topic_1. |
| `wal.replication_factor` | Integer | `1` | Expected number of replicas of each partition. |
| `wal.create_topic_timeout` | String | `30s` | Above which a topic creation operation will be cancelled. |
| `wal.backoff_init` | String | `500ms` | The initial backoff for kafka clients. |
| `wal.backoff_max` | String | `10s` | The maximum backoff for kafka clients. |
| `wal.backoff_base` | Integer | `2` | Exponential backoff rate, i.e. next backoff = base * current backoff. |
| `wal.backoff_deadline` | String | `5mins` | Stop reconnecting if the total wait time reaches the deadline. If this config is missing, the reconnecting won't terminate. |
| `logging` | -- | -- | The logging options. |
| `logging.dir` | String | `/tmp/greptimedb/logs` | The directory to store the log files. |
| `logging.level` | String | `None` | The log level. Can be `info`/`debug`/`warn`/`error`. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.sendto`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.self_import` | -- | -- | For `standalone` mode, `self_import` is recommend to collect metrics generated by itself<br/>You must create the database before enabling it. |
| `export_metrics.remote_write.url` | String | `""` | The url the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
| `tracing` | -- | -- | The tracing options. Only effect when compiled with `tokio-console` feature. |
| `tracing.tokio_console_addr` | String | `None` | The tokio console address. |
### Datanode
| Key | Type | Default | Descriptions |
| --- | -----| ------- | ----------- |
| `mode` | String | `standalone` | The running mode of the datanode. It can be `standalone` or `distributed`. |
| `node_id` | Integer | `None` | The datanode identifier and should be unique in the cluster. |
| `require_lease_before_startup` | Bool | `false` | Start services after regions have obtained leases.<br/>It will block the datanode start if it can't receive leases in the heartbeat from metasrv. |
| `init_regions_in_background` | Bool | `false` | Initialize all regions in the background during the startup.<br/>By default, it provides services after all regions have been initialized. |
| `grpc.tls.watch` | Bool | `false` | Watch for Certificate and key file change and auto reload.<br/>For now, gRPC tls config does not support auto reload. |
| `runtime` | -- | -- | The runtime options. |
| `runtime.global_rt_size` | Integer | `8` | The number of threads to execute the runtime for global read operations. |
| `runtime.compact_rt_size` | Integer | `4` | The number of threads to execute the runtime for global write operations. |
| `wal.provider` | String | `raft_engine` | The provider of the WAL.<br/>- `raft_engine`: the wal is stored in the local file system by raft-engine.<br/>- `kafka`: it's remote wal that data is stored in Kafka. |
| `wal.dir` | String | `None` | The directory to store the WAL files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.file_size` | String | `256MB` | The size of the WAL segment file.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_threshold` | String | `4GB` | The threshold of the WAL size to trigger a flush.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_interval` | String | `10m` | The interval to trigger a flush.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.read_batch_size` | Integer | `128` | The read batch size.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.sync_write` | Bool | `false` | Whether to use sync write.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.enable_log_recycle` | Bool | `true` | Whether to reuse logically truncated log files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.prefill_log_files` | Bool | `false` | Whether to pre-create log files on start up.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.sync_period` | String | `10s` | Duration for fsyncing log files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.broker_endpoints` | Array | -- | The Kafka broker endpoints.<br/>**It's only used when the provider is `kafka`**. |
| `wal.max_batch_bytes` | String | `1MB` | The max size of a single producer batch.<br/>Warning: Kafka has a default limit of 1MB per message in a topic.<br/>**It's only used when the provider is `kafka`**. |
| `wal.consumer_wait_timeout` | String | `100ms` | The consumer wait timeout.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_init` | String | `500ms` | The initial backoff delay.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_max` | String | `10s` | The maximum backoff delay.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_base` | Integer | `2` | The exponential backoff rate, i.e. next backoff = base * current backoff.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_deadline` | String | `5mins` | The deadline of retries.<br/>**It's only used when the provider is `kafka`**. |
| `wal.create_index` | Bool | `true` | Whether to enable WAL index creation.<br/>**It's only used when the provider is `kafka`**. |
| `wal.dump_index_interval` | String | `60s` | The interval for dumping WAL indexes.<br/>**It's only used when the provider is `kafka`**. |
| `storage` | -- | -- | The data storage options. |
| `storage.data_home` | String | `/tmp/greptimedb/` | The working home directory. |
| `storage.type` | String | `File` | The storage type used to store the data.<br/>- `File`: the data is stored in the local file system.<br/>- `S3`: the data is stored in the S3 object storage.<br/>- `Gcs`: the data is stored in the Google Cloud Storage.<br/>- `Azblob`: the data is stored in the Azure Blob Storage.<br/>- `Oss`: the data is stored in the Aliyun OSS. |
| `storage.cache_path` | String | `None` | Cache configuration for object storage such as 'S3' etc.<br/>The local file cache directory. |
| `storage.cache_capacity` | String | `None` | The local file cache capacity in bytes. |
| `storage.bucket` | String | `None` | The S3 bucket name.<br/>**It's only used when the storage type is `S3`, `Oss` and `Gcs`**. |
| `storage.root` | String | `None` | The S3 data will be stored in the specified prefix, for example, `s3://${bucket}/${root}`.<br/>**It's only used when the storage type is `S3`, `Oss` and `Azblob`**. |
| `storage.access_key_id` | String | `None` | The access key id of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3` and `Oss`**. |
| `storage.secret_access_key` | String | `None` | The secret access key of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3`**. |
| `storage.access_key_secret` | String | `None` | The secret access key of the aliyun account.<br/>**It's only used when the storage type is `Oss`**. |
| `storage.account_name` | String | `None` | The account key of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.account_key` | String | `None` | The account key of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.scope` | String | `None` | The scope of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.credential_path` | String | `None` | The credential path of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.credential` | String | `None` | The credential of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.container` | String | `None` | The container of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.sas_token` | String | `None` | The sas token of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.endpoint` | String | `None` | The endpoint of the S3 service.<br/>**It's only used when the storage type is `S3`, `Oss`, `Gcs` and `Azblob`**. |
| `storage.region` | String | `None` | The region of the S3 service.<br/>**It's only used when the storage type is `S3`, `Oss`, `Gcs` and `Azblob`**. |
| `[[region_engine]]` | -- | -- | The region engine options. You can configure multiple region engines. |
| `region_engine.mito.num_workers` | Integer | `8` | Number of region workers. |
| `region_engine.mito.worker_channel_size` | Integer | `128` | Request channel size of each worker. |
| `region_engine.mito.worker_request_batch_size` | Integer | `64` | Max batch size for a worker to handle requests. |
| `region_engine.mito.manifest_checkpoint_distance` | Integer | `10` | Number of meta action updated to trigger a new checkpoint for the manifest. |
| `region_engine.mito.compress_manifest` | Bool | `false` | Whether to compress manifest and checkpoint file by gzip (default false). |
| `region_engine.mito.max_background_jobs` | Integer | `4` | Max number of running background jobs |
| `region_engine.mito.auto_flush_interval` | String | `1h` | Interval to auto flush a region if it has not flushed yet. |
| `region_engine.mito.global_write_buffer_size` | String | `1GB` | Global write buffer size for all regions. If not set, it's default to 1/8 of OS memory with a max limitation of 1GB. |
| `region_engine.mito.global_write_buffer_reject_size` | String | `2GB` | Global write buffer size threshold to reject write requests. If not set, it's default to 2 times of `global_write_buffer_size` |
| `region_engine.mito.sst_meta_cache_size` | String | `128MB` | Cache size for SST metadata. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/32 of OS memory with a max limitation of 128MB. |
| `region_engine.mito.vector_cache_size` | String | `512MB` | Cache size for vectors and arrow arrays. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB. |
| `region_engine.mito.page_cache_size` | String | `512MB` | Cache size for pages of SST row groups. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/8 of OS memory. |
| `region_engine.mito.selector_result_cache_size` | String | `512MB` | Cache size for time series selector (e.g. `last_value()`). Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB. |
| `region_engine.mito.enable_experimental_write_cache` | Bool | `false` | Whether to enable the experimental write cache. |
| `region_engine.mito.experimental_write_cache_path` | String | `""` | File system path for write cache, defaults to `{data_home}/write_cache`. |
| `region_engine.mito.scan_parallelism` | Integer | `0` | Parallelism to scan a region (default: 1/4 of cpu cores).<br/>- `0`: using the default value (1/4 of cpu cores).<br/>- `1`: scan in current thread.<br/>- `n`: scan in parallelism n. |
| `region_engine.mito.parallel_scan_channel_size` | Integer | `32` | Capacity of the channel to send data from parallel scan tasks to the main task. |
| `region_engine.mito.allow_stale_entries` | Bool | `false` | Whether to allow stale WAL entries read during replay. |
| `region_engine.mito.index` | -- | -- | The options for index in Mito engine. |
| `region_engine.mito.index.aux_path` | String | `""` | Auxiliary directory path for the index in filesystem, used to store intermediate files for<br/>creating the index and staging files for searching the index, defaults to `{data_home}/index_intermediate`.<br/>The default name for this directory is `index_intermediate` for backward compatibility.<br/><br/>This path contains two subdirectories:<br/>- `__intm`: for storing intermediate files used during creating index.<br/>- `staging`: for storing staging files used during searching index. |
| `region_engine.mito.index.staging_size` | String | `2GB` | The max capacity of the staging directory. |
| `region_engine.mito.inverted_index` | -- | -- | The options for inverted index in Mito engine. |
| `region_engine.mito.inverted_index.create_on_flush` | String | `auto` | Whether to create the index on flush.<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.inverted_index.create_on_compaction` | String | `auto` | Whether to create the index on compaction.<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.inverted_index.apply_on_query` | String | `auto` | Whether to apply the index on query<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.inverted_index.mem_threshold_on_create` | String | `auto` | Memory threshold for performing an external sort during index creation.<br/>- `auto`: automatically determine the threshold based on the system memory size (default)<br/>- `unlimited`: no memory limit<br/>- `[size]` e.g. `64MB`: fixed memory threshold |
| `region_engine.mito.fulltext_index` | -- | -- | The options for full-text index in Mito engine. |
| `region_engine.mito.fulltext_index.create_on_flush` | String | `auto` | Whether to create the index on flush.<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.fulltext_index.create_on_compaction` | String | `auto` | Whether to create the index on compaction.<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.fulltext_index.apply_on_query` | String | `auto` | Whether to apply the index on query<br/>- `auto`: automatically (default)<br/>- `disable`: never |
| `region_engine.mito.fulltext_index.mem_threshold_on_create` | String | `auto` | Memory threshold for index creation.<br/>- `auto`: automatically determine the threshold based on the system memory size (default)<br/>- `unlimited`: no memory limit<br/>- `[size]` e.g. `64MB`: fixed memory threshold |
| `region_engine.mito.memtable.index_max_keys_per_shard` | Integer | `8192` | The max number of keys in one shard.<br/>Only available for `partition_tree` memtable. |
| `region_engine.mito.memtable.data_freeze_threshold` | Integer | `32768` | The max rows of data inside the actively writing buffer in one shard.<br/>Only available for `partition_tree` memtable. |
| `region_engine.mito.memtable.fork_dictionary_bytes` | String | `1GiB` | Max dictionary bytes.<br/>Only available for `partition_tree` memtable. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.sendto`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.write_interval` | String | `30s` | The interval of export metrics. |
| `export_metrics.self_import` | -- | -- | For `standalone` mode, `self_import` is recommend to collect metrics generated by itself<br/>You must create the database before enabling it. |
| `export_metrics.remote_write.url` | String | `""` | The url the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
# tracing exporter endpoint with format `ip:port`, we use grpc oltp as exporter, default endpoint is `localhost:4317`
# otlp_endpoint = "localhost:4317"
# Whether to append logs to stdout. Defaults to true.
# append_stdout = true
# The percentage of tracing will be sampled and exported. Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1. ratio > 1 are treated as 1. Fractions < 0 are treated as 0
# [logging.tracing_sample_ratio]
# default_ratio = 0.0
[[region_engine]]
## Enable the file engine.
[region_engine.file]
# Standalone export the metrics generated by itself
# encoded to Prometheus remote-write format
# and send to Prometheus remote-write compatible receiver (e.g. send to `greptimedb` itself)
# This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
# [export_metrics]
# whether enable export metrics, default is false
# enable = false
# The interval of export metrics
# write_interval = "30s"
# for `standalone`, `self_import` is recommend to collect metrics generated by itself
# [export_metrics.self_import]
# db = "information_schema"
## The logging options.
[logging]
## The directory to store the log files.
dir="/tmp/greptimedb/logs"
## The log level. Can be `info`/`debug`/`warn`/`error`.
## +toml2docs:none-default
level="info"
## Enable OTLP tracing.
enable_otlp_tracing=false
## The OTLP tracing endpoint.
otlp_endpoint="http://localhost:4317"
## Whether to append logs to stdout.
append_stdout=true
## The percentage of tracing will be sampled and exported.
## Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.
## ratio > 1 are treated as 1. Fractions < 0 are treated as 0
[logging.tracing_sample_ratio]
default_ratio=1.0
## The datanode can export its metrics and send to Prometheus compatible service (e.g. send to `greptimedb` itself) from remote-write API.
## This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape.
[export_metrics]
## whether enable export metrics.
enable=false
## The interval of export metrics.
write_interval="30s"
## For `standalone` mode, `self_import` is recommend to collect metrics generated by itself
## You must create the database before enabling it.
[export_metrics.self_import]
## +toml2docs:none-default
db="greptime_metrics"
[export_metrics.remote_write]
## The url the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=greptime_metrics`.
url=""
## HTTP headers of Prometheus remote-write carry.
headers={}
## The tracing options. Only effect when compiled with `tokio-console` feature.
The goal is to test string/text support for each database. In real scenarios it means the datasource(or log data producers) have separate fields defined, or have already processed the raw input.
__Unstructured model__
The log data is inserted as a long string, and then we build fulltext index upon these strings. For example an insert request looks like following
```SQL
INSERTINTOtest_table(message,timestamp)VALUES()
```
The goal is to test fuzzy search performance for each database. In real scenarios it means the log is produced by some kind of middleware and inserted directly into the database.
## Creating tables
See [here](./create_table.sql) for GreptimeDB and Clickhouse's create table clause.
The mapping of Elastic search is created automatically.
## Vector Configuration
We use vector to generate random log data and send inserts to databases.
Please refer to [structured config](./structured_vector.toml) and [unstructured config](./unstructured_vector.toml) for detailed configuration.
## SQLs and payloads
Please refer to [SQL query](./query.sql) for GreptimeDB and Clickhouse, and [query payload](./query.md) for Elastic search.
## Steps to reproduce
0. Decide whether to run structured model test or unstructured mode test.
1. Build vector binary(see vector's config file for specific branch) and databases binaries accordingly.
2. Create table in GreptimeDB and Clickhouse in advance.
3. Run vector to insert data.
4. When data insertion is finished, run queries against each database. Note: you'll need to update timerange value after data insertion.
## Addition
- You can tune GreptimeDB's configuration to get better performance.
- You can setup GreptimeDB to use S3 as storage, see [here](https://docs.greptime.com/user-guide/operations/configuration/#storage-options).
Reference to our [document](https://docs.greptime.com/getting-started/installation/overview) for how to install and start a GreptimeDB. Or you can also check this [document](https://docs.greptime.com/contributor-guide/getting-started#compile-and-run) for how to build a GreptimeDB from source.
## Write Data
After the DB is started, we can use `tsbs_load_greptime` to test the write performance.
```shell
./bin/tsbs_load_greptime \
--urls=http://localhost:4000 \
--file=./bench-data/influx-data.lp \
--batch-size=3000\
--gzip=false\
--workers=6
```
Parameters here are only provided as an example. You can choose whatever you like or adjust them to match your target scenario.
Notice that if you want to rerun `tsbs_load_greptime`, please destroy and restart the DB and clear its previous data first. Existing duplicated data will impact the write and query performance.
## Query Data
After the data is imported, you can then run queries. The following script runs all queries. You can also choose a subset of queries to run.
This document introduces how to write fuzz tests in GreptimeDB.
## What is a fuzz test
Fuzz test is tool that leverage deterministic random generation to assist in finding bugs. The goal of fuzz tests is to identify inputs generated by the fuzzer that cause system panics, crashes, or unexpected behaviors to occur. And we are using the [cargo-fuzz](https://github.com/rust-fuzz/cargo-fuzz) to run our fuzz test targets.
## Why we need them
- Find bugs by leveraging random generation
- Integrate with other tests (e.g., e2e)
## Resources
All fuzz test-related resources are located in the `/tests-fuzz` directory.
There are two types of resources: (1) fundamental components and (2) test targets.
### Fundamental components
They are located in the `/tests-fuzz/src` directory. The fundamental components define how to generate SQLs (including dialects for different protocols) and validate execution results (e.g., column attribute validation), etc.
### Test targets
They are located in the `/tests-fuzz/targets` directory, with each file representing an independent fuzz test case. The target utilizes fundamental components to generate SQLs, sends the generated SQLs via specified protocol, and validates the results of SQL execution.
Figure 1 illustrates the fundamental components of the fuzz test provide the ability to generate random SQLs. It utilizes a Random Number Generator (Rng) to generate the Intermediate Representation (IR), then employs a DialectTranslator to produce specified dialects for different protocols. Finally, the fuzz tests send the generated SQL via the specified protocol and verify that the execution results meet expectations.
For more details about fuzz targets and fundamental components, please refer to this [tracking issue](https://github.com/GreptimeTeam/greptimedb/issues/3174).
## How to add a fuzz test target
1. Create an empty rust source file under the `/tests-fuzz/targets/<fuzz-target>.rs` directory.
2. Register the fuzz test target in the `/tests-fuzz/Cargo.toml` file.
```toml
[[bin]]
name="<fuzz-target>"
path="targets/<fuzz-target>.rs"
test=false
bench=false
doc=false
```
3. Define the `FuzzInput` in the `/tests-fuzz/targets/<fuzz-target>.rs`.
@@ -79,7 +79,7 @@ This RFC proposes to add a new expression node `MergeScan` to merge result from
│ │ │ │
└─Frontend──────┘ └─Remote-Sources──────────────┘
```
This merge operation simply chains all the the underlying remote data sources and return `RecordBatch`, just like a coalesce op. And each remote sources is a gRPC query to datanode via the substrait logical plan interface. The plan is transformed and divided from the original query that comes to frontend.
This merge operation simply chains all the underlying remote data sources and return `RecordBatch`, just like a coalesce op. And each remote sources is a gRPC query to datanode via the substrait logical plan interface. The plan is transformed and divided from the original query that comes to frontend.
@@ -36,7 +36,7 @@ Hence, we choose the third option, and use a simple logical plan that's anagonis
## Deploy mode and protocol
- Greptime Flow is an independent streaming compute component. It can be used either within a standalone node or as a dedicated node at the same level as frontend in distributed mode.
- It accepts insert request Rows, which is used between frontend and datanode.
- New flow job is submitted in the format of modified SQL query like snowflake do, like: `CREATE TASK avg_over_5m WINDOW_SIZE = "5m" AS SELECT avg(value) FROM table WHERE time > now() - 5m GROUP BY time(1m)`. Flow job then got stored in MetaSrv.
- New flow job is submitted in the format of modified SQL query like snowflake do, like: `CREATE TASK avg_over_5m WINDOW_SIZE = "5m" AS SELECT avg(value) FROM table WHERE time > now() - 5m GROUP BY time(1m)`. Flow job then got stored in Metasrv.
- It also persists results in the format of Rows to frontend.
- The query plan uses Substrait as codec format. It's the same with GreptimeDB's query engine.
- Greptime Flow needs a WAL for recovering. It's possible to reuse datanode's.
The `datatypes` crate defines the elementary schema struct to describe the metadata.
## ColumnSchema
[ColumnSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/datatypes/src/schema/column_schema.rs#L36) represents the metadata of a column. It is equivalent to arrow's [Field](https://docs.rs/arrow/latest/arrow/datatypes/struct.Field.html) with additional metadata such as default constraint and whether the column is a time index. The time index is the column with a `TIME INDEX` constraint of a table. We can convert the `ColumnSchema` into an arrow `Field` and convert the `Field` back to the `ColumnSchema` without losing metadata.
[Schema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/datatypes/src/schema.rs#L38) is an ordered sequence of `ColumnSchema`. It is equivalent to arrow's [Schema](https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html) with additional metadata including the index of the time index column and the version of this schema. Same as `ColumnSchema`, we can convert our `Schema` from/to arrow's `Schema`.
```rust
usearrow::datatypes::SchemaasArrowSchema;
pubstructSchema{
column_schemas: Vec<ColumnSchema>,
name_to_index: HashMap<String,usize>,
arrow_schema: Arc<ArrowSchema>,
timestamp_index: Option<usize>,
version: u32,
}
pubtypeSchemaRef=Arc<Schema>;
```
We alias `Arc<Schema>` as `SchemaRef` since it is used frequently. Mostly, we use our `ColumnSchema` and `Schema` structs instead of Arrow's `Field` and `Schema` unless we need to invoke third-party libraries (like DataFusion or ArrowFlight) that rely on Arrow.
## RawSchema
`Schema` contains fields like a map from column names to their indices in the `ColumnSchema` sequences and a cached arrow `Schema`. We can construct these fields from the `ColumnSchema` sequences thus we don't want to serialize them. This is why we don't derive `Serialize` and `Deserialize` for `Schema`. We introduce a new struct [RawSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/datatypes/src/schema/raw.rs#L24) which keeps all required fields of a `Schema` and derives the serialization traits. To serialize a `Schema`, we need to convert it into a `RawSchema` first and serialize the `RawSchema`.
```rust
pubstructRawSchema{
pubcolumn_schemas: Vec<ColumnSchema>,
pubtimestamp_index: Option<usize>,
pubversion: u32,
}
```
We want to keep the `Schema` simple and avoid putting too much business-related metadata in it as many different structs or traits rely on it.
# Schema of the Table
A table maintains its schema in [TableMeta](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/table/src/metadata.rs#L97).
```rust
pubstructTableMeta{
pubschema: SchemaRef,
pubprimary_key_indices: Vec<usize>,
pubvalue_indices: Vec<usize>,
// ...
}
```
The order of columns in `TableMeta::schema` is the same as the order specified in the `CREATE TABLE` statement which users use to create this table.
The field `primary_key_indices` stores indices of primary key columns. The field `value_indices` records the indices of value columns (non-primary key and time index, we sometimes call them field columns).
We split a table into one or more units with the same schema and then store these units in the storage engine. Each unit is a region in the storage engine.
The storage engine maintains schemas of regions in more complicated ways because it
- adds internal columns that are invisible to users to store additional metadata for each row
- provides a data model similar to the key-value model so it organizes columns in a different order
- maintains additional metadata like column id or column family
So the storage engine defines several schema structs:
- RegionSchema
- StoreSchema
- ProjectedSchema
## RegionSchema
A [RegionSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/region.rs#L37) describes the schema of a region.
```rust
pubstructRegionSchema{
user_schema: SchemaRef,
store_schema: StoreSchemaRef,
columns: ColumnsMetadataRef,
}
```
Each region reserves some columns called `internal columns` for internal usage:
-`__sequence`, sequence number of a row
-`__op_type`, operation type of a row, such as `PUT` or `DELETE`
-`__version`, user-specified version of a row, reserved but not used. We might remove this in the future
The table engine can't see the `__sequence` and `__op_type` columns, so the `RegionSchema` itself maintains two internal schemas:
- User schema, a `Schema` struct that doesn't have internal columns
- Store schema, a `StoreSchema` struct that has internal columns
The `ColumnsMetadata` struct keeps metadata about all columns but most time we only need to use metadata in user schema and store schema, so we just ignore it. We may remove this struct in the future.
`RegionSchema` organizes columns in the following order:
```
key columns, timestamp, [__version,] value columns, __sequence, __op_type
```
We can ignore the `__version` column because it is disabled now:
```
key columns, timestamp, value columns, __sequence, __op_type
```
Key columns are columns of a table's primary key. Timestamp is the time index column. A region sorts all rows by key columns, timestamp, sequence, and op type.
So the `RegionSchema` of our `cpu` table above looks like this:
```json
{
"user_schema":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system"
],
"store_schema":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"__sequence",
"__op_type"
]
}
```
## StoreSchema
As described above, a [StoreSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/store.rs#L36) is a schema that knows all internal columns.
```rust
structStoreSchema{
columns: Vec<ColumnMetadata>,
schema: SchemaRef,
row_key_end: usize,
user_column_end: usize,
}
```
The columns in the `columns` and `schema` fields have the same order. The `ColumnMetadata` has metadata like column id, column family id, and comment. The `StoreSchema` also stores this metadata in `StoreSchema::schema`, so we can convert the `StoreSchema` between arrow's `Schema`. We use this feature to persist the `StoreSchema` in the SST since our SST format is `Parquet`, which can take arrow's `Schema` as its schema.
The `StoreSchema` of the region above is similar to this:
```json
{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"__sequence",
"__op_type"
],
"time_index":2,
"version":0
},
"row_key_end":3,
"user_column_end":5
}
```
The key and timestamp columns form row keys of rows. We put them together so we can use `row_key_end` to get indices of all row key columns. Similarly, we can use the `user_column_end` to get indices of all user columns (non-internal columns).
Another useful feature of `StoreSchema` is that we ensure it always contains key columns, a timestamp column, and internal columns because we need them to perform merge, deduplication, and delete. Projection on `StoreSchema` only projects value columns.
## ProjectedSchema
To support arbitrary projection, we introduce the [ProjectedSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/projected.rs#L106).
```rust
pubstructProjectedSchema{
projection: Option<Projection>,
schema_to_read: StoreSchemaRef,
projected_user_schema: SchemaRef,
}
```
We need to handle many cases while doing projection:
- The columns' order of table and region is different
- The projection can be in arbitrary order, e.g. `select usage_user, host from cpu` and `select host, usage_user from cpu` have different projection order
- We support `ALTER TABLE` so data files may have different schemas.
### Projection
Let's take an example to see how projection works. Suppose we want to select `ts`, `usage_system` from the `cpu` table.
The query engine uses the projection `[0, 3]` to scan the table. However, columns in the region have a different order, so the table engine adjusts the projection to `2, 4`.
```json
{
"user_schema":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system"
],
}
```
As you can see, the output order is still `[ts, usage_system]`. This is the schema users can see after projection so we call it `projected user schema`.
But the storage engine also needs to read key columns, a timestamp column, and internal columns. So we maintain a `StoreSchema` after projection in the `ProjectedSchema`.
The `Projection` struct is a helper struct to help compute the projected user schema and store schema.
So we can construct the following `ProjectedSchema`:
```json
{
"schema_to_read":{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_system",
"__sequence",
"__op_type"
],
"time_index":2,
"version":0
},
"row_key_end":3,
"user_column_end":4
},
"projected_user_schema":{
"column_schemas":[
"ts",
"usage_system"
],
"time_index":0
}
}
```
As you can see, `schema_to_read` doesn't contain the column `usage_user` that is not intended to be read (not in projection).
### ReadAdapter
As mentioned above, we can alter a table so the underlying files (SSTs) and memtables in the storage engine may have different schemas.
To simplify the logic of `ProjectedSchema`, we handle the difference between schemas before projection (constructing the `ProjectedSchema`). We introduce [ReadAdapter](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/compat.rs#L90) that adapts rows with different source schemas to the same expected schema.
So we can always use the current `RegionSchema` of the region to construct the `ProjectedSchema`, and then create a `ReadAdapter` for each memtable or SST.
```rust
#[derive(Debug)]
pubstructReadAdapter{
source_schema: StoreSchemaRef,
dest_schema: ProjectedSchemaRef,
indices_in_result: Vec<Option<usize>>,
is_source_needed: Vec<bool>,
}
```
For each column required by `dest_schema`, `indices_in_result` stores the index of that column in the row read from the source memtable or SST. If the source row doesn't contain that column, the index is `None`.
The field `is_source_needed` stores whether a column in the source memtable or SST is needed.
Suppose we add a new column `usage_idle` to the table `cpu`.
```sql
ALTERTABLEcpuADDCOLUMNusage_idleDOUBLE;
```
The new `StoreSchema` becomes:
```json
{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"usage_idle",
"__sequence",
"__op_type"
],
"time_index":2,
"version":1
},
"row_key_end":3,
"user_column_end":6
}
```
Note that we bump the version of the schema to 1.
If we want to select `ts`, `usage_system`, and `usage_idle`. While reading from the old schema, the storage engine creates a `ReadAdapter` like this:
```json
{
"source_schema":{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"__sequence",
"__op_type"
],
"time_index":2,
"version":0
},
"row_key_end":3,
"user_column_end":5
},
"dest_schema":{
"schema_to_read":{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_system",
"usage_idle",
"__sequence",
"__op_type"
],
"time_index":2,
"version":1
},
"row_key_end":3,
"user_column_end":5
},
"projected_user_schema":{
"column_schemas":[
"ts",
"usage_system",
"usage_idle"
],
"time_index":0
}
},
"indices_in_result":[
0,
1,
2,
3,
null,
4,
5
],
"is_source_needed":[
true,
true,
true,
false,
true,
true,
true
]
}
```
We don't need to read `usage_user` so `is_source_needed[3]` is false. The old schema doesn't have column `usage_idle` so `indices_in_result[4]` is `null` and the `ReadAdapter` needs to insert a null column to the output row so the output schema still contains `usage_idle`.
The figure below shows the relationship between `RegionSchema`, `StoreSchema`, `ProjectedSchema`, and `ReadAdapter`.
```text
┌──────────────────────────────┐
│ │
│ ┌────────────────────┐ │
│ │ store_schema │ │
│ │ │ │
│ │ StoreSchema │ │
│ │ version 1 │ │
│ └────────────────────┘ │
│ │
│ ┌────────────────────┐ │
│ │ user_schema │ │
│ └────────────────────┘ │
│ │
│ RegionSchema │
│ │
└──────────────┬───────────────┘
│
│
│
┌──────────────▼───────────────┐
│ │
│ ┌──────────────────────────┐ │
│ │ schema_to_read │ │
│ │ │ │
│ │ StoreSchema (projected) │ │
│ │ version 1 │ │
│ └──────────────────────────┘ │
┌───┤ ├───┐
│ │ ┌──────────────────────────┐ │ │
│ │ │ projected_user_schema │ │ │
│ │ └──────────────────────────┘ │ │
│ │ │ │
│ │ ProjectedSchema │ │
dest schema │ └──────────────────────────────┘ │ dest schema
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.