* feat: support invalidate schema name key cache
* fix: remove pub for invalidate_schema_cache
* refactor: add DropMetadataBroadcast State Op
* fix: delete files
* feat: transofrm substrait SELECT&WHERE&GROUP BY to Flow Plan
* chore: reexport from common/substrait
* feat: use datafusion Aggr Func to map to Flow aggr func
* chore: remove unwrap&split literal
* refactor: split transform.rs into smaller files
* feat: apply optimize for variadic fn
* refactor: split unit test
* chore: per review
* fix: cli export `create table` with quoted names
* add test
* apply review comments
* fix to pass check
* remove eprintln for clippy check
* use prebuilt binary to avoid compile
* ci run coverage after build
* drop dirty hack test
Signed-off-by: tison <wander4096@gmail.com>
---------
Signed-off-by: tison <wander4096@gmail.com>
Co-authored-by: tison <wander4096@gmail.com>
* refactor: func's specialization& use Error not EvalError
* docs: some pub item
* chore: typo
* docs: add comments for every pub item
* chore: per review
* chore: per reveiw&derive Copy
* chore: per review&test for binary fn spec
* docs: comment explain how binary func spec works
* chore: minor style change
* fix: Error not EvalError
* chore: keep the same method order in KvBackend
* feat: make meta client can get all node info of cluster
* feat: cluster info data model
* feat: frontend and datanode info
* feat: list node info
* chore: remove the method: is_started
* fix: scan key prefix
* chore: impl From for NodeInfoKey
* chore: doc for trait and struct
* chore: reuse the error
* chore: refactor two collec cluster info handlers
* chore: remove inline
* chore: refactor two collec cluster info handlers
* fix: move object store read/write timer into inner
* add Drop for PrometheusMetricWrapper
* call await on async read/write
* apply review comments
* git rid of option on timer
* test: add integration_test for datetime style
* feat: support various datestyle for postgres
* doc: rewrite the comment about merge_datestyle_value
* test: add more test to illustrate valid datestyle input
* feat: add __schema__ tag for promql parser
* feat: disable matcher op other than equals
* test: add more test to ensure context getting reset
* test: add integration test
* test: refactor tests
* refactor: remove duplicated test code
* refactor: update according to review comments
* test: add sqlness test for cross schema scenario
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* feat(tql): add initial support for start,stop,step as sql functions
* fix(tql): remove unwraps, adjust fmt
* fix(tql): address taplo issue
* feat(tql): update parse_tql_query logic
* fix(tql): change query parsing logic to use parser instead of delimiter
* fix(tql): add timestamp function support, add sqlness tests
* fix(tql): add lookback optional param for tql eval
* fix(tql): adjust tests for now() function
* fix(tql): introduce the tqlerror to differentiate failures on parsing, evaluation and simplification stages
* fix(tql): add tests for explain/analyze
* feat(tql): add lookback support for explain/analyze, update tests
* feat(tql): add more sqlness tests
* chore(tql): extract common logic for eval, analyze and explain into a single function
* feat(tql): address CR points
* feat(tql): use snafu for tql errors, add more docs
* feat(tql): address CR points
* test: add integration tests for kafka wal
* chore: rebase main
* chore: unify naming convention for wal config
* chore: add register loaders switch
* chore: alter tables by adding a new column
* chore: move rand to dev-dependencies
* chore: update Cargo.lock
* feat: support set variable statement of session
* feat: support printing postgresql's bytea data type in its "hex" and "escape" format in ugly way
* refactor: add 'SessionConfigValue' type and unify the name
* doc: add license header
* refactor: confine coupling with 'sql::ast::Value' in SessionConfigValue
* refactor: move all bytea wrapper into bytea.rs
* fix: remove unused import in context.rs and postgres.rs
* refactor: rename 'set_configuration_parameter' to 'set_session_config'
rename 'set_configuration_parameter' in statement_.rs to 'set_session_config'
* refactor: use mod to organize options via macro
* refactor: re-model the session config value with static type
* test: add integration test
* refactor: move the encode bytea by format type logic into encoder
refactor: use Arc<DashMap> instead of DashMap in QueryContext
refactor: use Arc<DashMap> instead of DashMap in QueryContext
Avoid expensive clone
refactor: use unreachable!() instead of unimplemented!()
refactor: move the encode bytea by format type logic into encoder
test: add binary format integration test case
* test: add ut for byte related type
* doc: remove TODO of bytea_output
* refactor: simplify the implementation with simple struct instead of complex typing
* fix: typo of 'Available'
* fix compile
Signed-off-by: tison <wander4096@gmail.com>
---------
Signed-off-by: tison <wander4096@gmail.com>
Co-authored-by: tison <wander4096@gmail.com>
* feat: Able to pretty print sql query result in http output
* fix: add some tests
* fix: add some space, delete fn into_payload, and impl Display for TableResponse
* feat: Arrangement shared state
* feat: arrange&tests
* docs: detailed&tests for get
* chore: license
* refactor: opt out ts expr&tests: internal ts
* docs: remove some TODOs
* feat: use smallvec size of 2
* refactor: per review
* chore: per review
* chore: per review
* chore: remove reduant clone
* feat: return max expire time&docs: more explain cur expire config
* feat: add memtable builder to region
* refactor: rename memtable_builder in worker to default_memtable_builder
* fix: return error instead of using default compaction options
Support deserializing memtable and compaction options from the option
map
* feat: optional memtable options
* feat: add MemtableBuilderProvider to create builders
* feat: change default memtable and skip deserializing dedup
* chore: update test and comment
* chore: test invalid type
* feat: metric engine use new memtable manually
* feat: expose more memtable configs
* feat: add memtable options to valid option list
* test: add test
* test: sqlness test
* chore: serde workspace
* chore: remove comments
* feat: handle flush periodically
* chore: call periodical method in loop
* feat: check periodical tasks on channel timeout
* refactor: use time provider to get time
Mock a time provider to test auto flush
* chore: fix typos
* refactor: rename mock time provider
* style: fix cilppy
* chore: address comment
* feat: acquire catalog and schema lock in region failover
* chore: remove unused code
* feat!: acquire catalog and schema lock in region migration
* feat: acquire catalog and schema lock in create table
* feat: call freeze if the active data buffer in a shard is full
* chore: more metrics
* chore: print metrics
* chore: enlarge freeze threshold
* test: test freeze
* test: fix config test
* feat(influxdb): add db query param support for v2 write api
* fix(influxdb): update authorize logic to get catalog and schema from query string
* fix(influxdb): address CR suggestions
* fix(influxdb): use the correct import
* feat: add create table fuzz test
* chore: add ci cfg for fuzz tests
* refactor: remove redundant nightly config
* chore: run fuzz test in debug mode
* chore: use ubuntu-latest
* fix: close connection
* chore: add cache in fuzz test ci
* chore: apply suggestion from CR
* chore: apply suggestion from CR
* chore: refactor the fuzz test action
* feat: add configuration for tls watch option
* test: sleep longer to ensure async task run
* test: update config api integration test
* refactor: rename function
* feat: Support automatic DNS lookup for kafka bootstrap servers
* Revert "feat: Support automatic DNS lookup for kafka bootstrap servers"
This reverts commit 5baed7b01d.
* feat: Support automatic DNS lookup for Kafka broker
* fix: resolve broker endpoint in client manager
* fix: apply clippy lints
* refactor: slimplify the code with clippy hint
* refactor: move resolve_broker_endpoint to common/wal/src/lib.rs
* test: add mock test for resolver_broker_endpoint
* refactor: accept niebayes's advice
* refactor: rename EndpointIpNotFound to EndpointIPV4NotFound
* refactor: remove mock test and simplify the implementation
* docs: add comments about test_vallid_host_ipv6
* Apply suggestions from code review
Co-authored-by: niebayes <niebayes@gmail.com>
* move more common code
Signed-off-by: tison <wander4096@gmail.com>
---------
Signed-off-by: tison <wander4096@gmail.com>
Co-authored-by: tison <wander4096@gmail.com>
Co-authored-by: niebayes <niebayes@gmail.com>
* feat: partition level map
* test: test shard and builder
* fix: do not use pk index from shard builder
* feat: add multi key test
* fix: freeze shard before finding pk in shards
* fix: dict builder resets num_keys on finish
* feat: skip empty shard and builder
* feat: avoid pruning if possible
Implementations:
- Apply all filters on the partition column
- If no filter to prune, skip decoding keys
* refactor: change the receivers of Shard::read/DataBuffer::read/DataParts::read to &self instead of &mut self
* refactor: remove allow(dead_code) in merge tree
* fix: KeyValues num_fields() is incorrect
* chore: fix warnings
* feat: support dedup
* feat: allow using the new memtable
* feat: serde default for config
* fix: resets pk index after finishing a dict
* feat: add fork method to the memtable
* feat: allow mark immutable returns result
* feat: use fork to create the mutable memtable
* feat: remove memtable builder from freeze
* chore: warninigs
* fix: inspect error
* feat: iter returns result
* chore: maintains memtable id in region
* chore: update comment
* fix: remove region status if failed to freeze a memtable
* chroe: update comment
* chore: iter should not require sync
* chore: implement freeze and fork for the new memtable
* refactor: data reader returns reference to data batch
* refactor: use range to create merger
* chore: Reference RecordBatch in DataBatch
* fix: top node not read if no next node
* refactor: move timestamp_array_to_i64_slice to data mod
* style: fix cilppy
* chore: derive copy for DataBatch
* chore: address CR comments
* feat: write to a shard or a shard builder
* feat: freeze and fork for partition and shards
* chore: shard builder
* chore: change dict reader to support random access
* test: test write shard
* test: test write
* test: test memtable
* feat: add new and write_row to DataParts
* refactor: partition freeze shards
* refactor: write_with_pk_id
* style: fix clippy
* chore: add methods to get pk weights
* chroe: fix compiler errors
* feat: impl merge reader for DataParts
* fix: fmt
* fix: sort rows with pk and ts according to sequnce desc
* fix: remove pk weight as pk index are already replace by weights
* fix: format
* fix: some cr comments
* fix: some cr comments
* refactor: simply trait's associated types
* fix: some cr comments
* refactor: set the actual bound port so we can use port 0 in testing
* Update src/servers/src/server.rs
Co-authored-by: Weny Xu <wenymedia@gmail.com>
* fmt
---------
Co-authored-by: Weny Xu <wenymedia@gmail.com>
* fix: logical region can't find region routes
* feat: fetch partitions info in batch
* refactor: rename batch functions
* refactor: rename DdlTaskExecutor to ProcedureExecutor
* feat: impl migrate_region and query_procedure_state for ProcedureExecutor
* feat: adds SQL function procedure_state and finish migrate_region impl
* fix: constant vector
* feat: unit tests for migrate_region and procedure_state
* test: test region migration by SQL
* fix: compile error after rebeasing
* fix: clippy warnings
* feat: ensure procedure_state and migrate_region can be only called under greptime catalog
* fix: license header
* feat: impl for ScalarExpr
* feat: plain functions
* refactor: simpler trait bound&tests
* chore: remove unused imports
* chore: fmt
* refactor: early ret on first error
* refactor: remove abunant match arm
* chore: per review
* doc: `support` fn
* chore: per review more
* chore: more per review
* fix: extract_bound
* chore: per review
* refactor: reduce nest
* feat: replace pk index with pk_weight during freeze
* chore: add parameter to control pk_index replacement
* fix: dedup pk weights also
* fix: generate pk array before dedup
* feat: data buffer and related structs
* fix: some cr comments
* chore: remove freeze_threshold in DataBuffer
* fix: use LazyMutableVectorBuilder instead of two vector; add option to control dedup
* fix: dedup rows according to both pk weights and timestamps
* fix: assembly DataBatch on demand
* refactor: bring metrics to http output
* chore: remove unwrap
* chore: make walk plan accumulate
* chore: change field name and comment
* chore: add metrics to http resp header
* chore: move PrometheusJsonResponse to a separate file and impl IntoResponse
* chore: put metrics in prometheus resp header too
* fix(util): join_path function should not trim leading `/`
Signed-off-by: Hudson C. Dalpra <dalpra.hcd@gmail.com>
* fix(util): making required changes at join_path function
* fix(util): added unit tests to match function comments
---------
Signed-off-by: Hudson C. Dalpra <dalpra.hcd@gmail.com>
* chore: start plugins in standalone
* chore: respect current catalog in use statement for mysql
* chore: reduce unnecessory convert to string
* chore: reduce duplicate code
* feat: add arrow format output for sql api
* refactor: remove unwraps
* test: add test for arrow format
* chore: update cargo toml format
* fix: resolve lint warrnings
* fix: ensure outputs size is one
* feat: let TypeConversionRule aware query context timezone setting
* chore: don't optimize explain command
* feat: parse string into timestamp with timezone
* fix: compile error
* chore: check the scalar value type in predicate
* chore: remove mut for engine context
* chore: return none if the scalar value is utf8 in time range predicate
* fix: some fixme
* feat: let Date and DateTime parsing from string value be aware of timezone
* chore: tweak
* test: add datetime from_str test with timezone
* feat: construct function context from query context
* test: add timezone test for to_unixtime and date_format function
* fix: typo
* chore: apply suggestion
* test: adds string with timezone
* chore: apply CR suggestion
Co-authored-by: Lei, HUANG <6406592+v0y4g3r@users.noreply.github.com>
* chore: apply suggestion
---------
Co-authored-by: Lei, HUANG <6406592+v0y4g3r@users.noreply.github.com>
* feat(tests-fuzz): add CreateTableExprGenerator
* refactor: move Column to root of ir mod
* feat: add AlterTableExprGenerator
* feat: add Serialize and Deserialize derive
* chore: refactor the AlterExprGenerator
* feat: auto config cache and buffer size according to mem size
* feat: utils
* refactor: add util function to common config
* refactor: check cgroups
* refactor: code
* fix: test
* fix: test
* chore: cr comment
Co-authored-by: Yingwen <realevenyag@gmail.com>
Co-authored-by: Dennis Zhuang <killme2008@gmail.com>
* chore: remove default comment
---------
Co-authored-by: Yingwen <realevenyag@gmail.com>
Co-authored-by: Dennis Zhuang <killme2008@gmail.com>
* feat: adds date_format function
* fix: compile error
* chore: use system timezone for FunctionContext and EvalContext
* test: as_formatted_string
* test: sqlness test
* chore: rename function
* docs: Update README.md
Complying with ASF policy we should refer to Apache projects in their full form in the first and most prominent usage.
* Update README.md
* Update README.md
* test: add unit tests
* feat: introduce kafka runtime backed by testcontainers
* test: add test for kafka runtime
* fix: format
* chore: make kafka image ready to be used
* feat: add entry builder
* tmp
* test: add unit tests for client manager
* test: add some unit tests for kafka log store
* chore: resolve some todos
* chore: resolve some todos
* test: add unit tests for kafka log store
* chore: add deprecate develop branch warning
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* tmp: ready to move unit tests to an indie dir
* test: update unit tests for client manager
* test: add unit tests for meta srv remote wal
* fix: license
* fix: test
* refactor: kafka image
* doc: add doc example for kafka image
* chore: migrate kafka image to an indie PR
* fix: CR
* fix: CR
* fix: test
* fix: CR
* fix: update Cargo.toml
* fix: CR
* feat: skip test if no endpoints env
* fix: format
* test: rewrite parallel test with barrier
---------
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* feat: split an entry if it's too large
* chore: rewrite check records
* test: add some unit tests for record
* chore: rewrite entry splitting
* chore: add unit tests for build records
* chore: add more unit tests for record
* chore: rewrite encdec of record
* revert: ignored test
* fix: set limit for max_batch_size
* fix: clippy
* chore: remove heavy logging
* fix: CR
* fix: properly terminate
* fix: CR
* fix: compiling
* fix: sqlness
* fix: CR
* fix: license
* fix: license
* refactor(metrics): add 'greptimedb_' prefix for every metrics
* chore: use 'greptime_' as prefix
* chore: add some prefix for new metrics
* chore: fix format error
* feat: implement `KeyRwLock`
* refactor: use KeyRwLock instead of LockMap
* refactor: use StringKey instead of String
* chore: remove redundant code
* refactor: cleanup KeyRwLock staled locks before granting new lock
* feat: clean staled locks manually
* feat: sort lock key in lexicographically order
* feat: ensure the ref count before dropping the rwlock
* feat: add more tests for rwlock
* feat: drop the key guards first
* feat: drops the key guards in the reverse order
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* fix: fix heartbeat handler ignore upgrade candidate instruction
* fix: fix handler did not inject wal options
* feat: expose `RegionMigrationProcedureTask`
* feat(tests-integration): add a naive region migration test
* chore: apply suggestions from CR
* feat: add test if the target region has migrated
* chore: apply suggestions from CR
* chore(tests-integration): add setup tests with kafka wal to README.md
* feat(tests-integration): add meta wal config
* fix(tests-integration): fix sign of both_instances_cases_with_kafka_wal
* chore(tests-integration): set num_topic to 3 for tests
* test(tests-integration): add a naive test with kafka wal
* chore: apply suggestions from CR
* refactor: use string type instead of Option type for '--store-key-prefix'
Signed-off-by: zyy17 <zyylsxm@gmail.com>
* chore: refine for code review comments
---------
Signed-off-by: zyy17 <zyylsxm@gmail.com>
* feat: remote write metric task
* chore: pass stanalone task to frontend
* chore: change name to system metric
* fix: add header and rename to export metrics
* refactor: open regions in background
* feat: add status code of RegionNotReady
* feat: use RegionNotReady instead of RegionNotFound for a registering region
* chore: apply suggestions from CR
* feat: add status code of RegionBusy
* feat: return RegionBusy for mutually exclusive operations
* refactor: refactor wal config
* test: update tests related to wal
* feat: introduce kafka wal config
* chore: augment proto with wal options
* feat: augment region open request with wal options
* feat: augment mito region with wal options
* feat: augment region create request with wal options
* refactor: refactor log store trait
* feat: add skeleton for kafka log store
* feat: generalize building log store when starting datanode
* feat: integrate wal options to region write
* chore: minor update
* refactor: remove wal options from region create/open requests
* fix: compliation issues
* chore: insert wal options into region options upon initializing region server
* chore: integrate wal options into region options
* chore: fill in kafka wal config
* chore: reuse namespaces while writing to wal
* chore: minor update
* chore: fetch wal options from region while handling truncate/flush
* fix: region options test
* fix: resolve some review conversations
* refactor: serde with wal options
* fix: resolve some review conversations
* feat: introduce wal config and kafka config
* feat: introduce kafka topic manager and selector
* feat: introduce region wal options
* chore: build region wal options upon starting meta srv
* feat: integrate region wal options allocator into table meta allocator
* chore: add wal config to metasrv.example.toml
* chore: add region wal options map to create table procedure
* feat: augment region create request with wal options
* feat: augment DatanodeTableValue with region wal options map
* chore: encode region wal options upon constructing table creator
* feat: persist region wal options when creating table meta
* fix: sqlness test
* chore: set default wal provider to raft-engine
* refactor: refactor wal options
* chore: update wal options allocator
* refactor: rename region wal options to wal options
* chore: update usages of region wal options
* chore: add some comments to kafka
* chore: fill in kafka config
* test: add tests for serde wal config
* test: add tests for wal options
* refactor: refactor wal options allocator to enum
* refactor: store wal options into the request options instead
* fix: typo
* fix: typo
* refactor: move wal options map to region info
* refactor: refacto serialization and deserialization of wal options
* refactor: use serde_json to encode wal options
* chore: rename wal_options_map to region_wal_options
* chore: resolve some review comments
* fix: typo
* refactor: replace kecab-case with snake_case
* fix: sqlness and converage tests
* fix: typo
* fix: coverage test
* fix: coverage test
* chore: resolve some review conversations
* fix: resolve some review conversations
* chore: format comments in metasrv.example.toml
* chore: update import style
* feat: integrate wal options allocator to standalone mode
* test: add compatible test for OpenRegion
* test: add compatible test for UpdateRegionMetadata
* chore: remove based suffix from topic selector type
* fix: typos and bit operation
* fix: helper
* chore: add tests in decimal128.rs and interval.rs
* chore: test
* chore: change proto version
* chore: clippy
* chore: change auth_fn to function and return response with json body
* chore: move unsupported to debug level
* chore: add docs and tests
* chore: rebase and update test
* feat: sql with influxdb v1 result format
* chore: add unit tests
* feat: minor refactor
* chore: by comment
* chore; u128 to u64 since serde can't deser u128 in enum
* chore: by comment
* chore: apply suggestion
* chore: revert suggestion
* chore: try again
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* fix: use linear interpolation to implement range LINEAR fill strategy
* chore: update test case
* chore: optimize linear interpolation implementation
* chore: update test and add comment
* feat: add build function and register it
build() function to return the database build info #2909
* refactor: fix typos and change code structure
* test: add test for build()
* refactor: cargo fmt and eliminate warnings
* Apply suggestions from code review
Co-authored-by: Weny Xu <wenymedia@gmail.com>
* refactor: move system.sql to a new directory
---------
Co-authored-by: Weny Xu <wenymedia@gmail.com>
* test: add flow test for open candidate region with retryable error
* test: add flow test for upgrade candidate retry failed
* test: add flow test for upgrade candidate with retry
* refactor: use downgrading the region instead of closing region
* feat: enhance the tests for alive keeper
* feat: add a metric to track region lease expired
* chore: apply suggestions from CR
* chore: enable logging for test_distributed_handle_ddl_request
* refactor: simplify lease keeper
* feat: add metrics for lease keeper
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* refactor: move OpeningRegionKeeper to common_meta
* feat: register operating regions to MemoryRegionKeeper
* feat: add backward compatibility test for persistent ctx
* refactor: refactor State of region migration
* feat: add test utils for region migration tests
* test: add simple region migration tests
* chore: apply suggestions from CR
* feat: adds date_add and date_sub function
* test: add date function
* fix: adds interval to date returns wrong result
* fix: header
* fix: typo
* fix: timestamp resolution
* fix: capacity
* chore: apply suggestion
* fix: wrong behavior when adding intervals to timestamp, date and datetime
* chore: remove unused error
* test: refactor and add some tests
* chore: add logs and metrics
* feat: add the timer to track heartbeat intervel
* feat: add the gauge to track region leases
* refactor: use gauge instead of the timer
* chore: apply suggestions from CR
* feat: add hit rate and etcd txn metrics
* feat: add random weigted choose in load_based selector
* fix: meta cannot save heartbeats when cluster have no region
* chore: print some log
* chore: remove unused code
* cr
* add some logs when filter result is empty
* feat: add page cache
* docs: update mito config toml
* feat: impl CachedPageReader
* feat: use cache reader to read row group
* feat: do not fetch data if we have pages in cache
* chore: return if nothing to fetch
* feat: enlarge page cache size
* test: test write read parquet
* test: test cache
* docs: update comments
* test: fix config api test
* feat: cache metrics
* feat: change default page cache size
* test: fix config api test
* feat: bump prost and fix pprof feature compiler errors
* feat: fix compiler errors on tokio-console
* chore: fix compiler errors
* ci: add all features check to ci
* feat: decrease the `page size` if the response message size exceeds the limit
* chore: apply suggestions from CR
* feat: prefer to use adaptive_page_size
* chore: apply suggestions from CR
* feat: Control merge reader by batch size
* test: test heap have large range
* fix: merge one batch
* test: merge many duplicates
* test: test reheap hot
* feat: don't handle empty batch in merge reader
Thanks a lot for considering contributing to GreptimeDB. We believe people like you would make GreptimeDB a great product. We intend to build a community where individuals can have open talks, show respect for one another, and speak with true ❤️. Meanwhile, we are to keep transparency and make your effort count here.
Read the guidelines, and they can help you get started. Communicate with respect to developers maintaining and developing the project. In return, they should reciprocate that respect by addressing your issue, reviewing changes, as well as helping finalize and merge your pull requests.
Please read the guidelines, and they can help you get started. Communicate with respect to developers maintaining and developing the project. In return, they should reciprocate that respect by addressing your issue, reviewing changes, as well as helping finalize and merge your pull requests.
Follow our [README](https://github.com/GreptimeTeam/greptimedb#readme) to get the whole picture of the project. To learn about the design of GreptimeDB, please refer to the [design docs](https://github.com/GrepTimeTeam/docs).
@@ -10,7 +10,7 @@ Follow our [README](https://github.com/GreptimeTeam/greptimedb#readme) to get th
It can feel intimidating to contribute to a complex project, but it can also be exciting and fun. These general notes will help everyone participate in this communal activity.
- Follow the [Code of Conduct](https://github.com/GreptimeTeam/greptimedb/blob/develop/CODE_OF_CONDUCT.md)
- Follow the [Code of Conduct](https://github.com/GreptimeTeam/greptimedb/blob/main/CODE_OF_CONDUCT.md)
- Small changes make huge differences. We will happily accept a PR making a single character change if it helps move forward. Don't wait to have everything working.
- Check the closed issues before opening your issue.
- Try to follow the existing style of the code.
@@ -21,12 +21,12 @@ Pull requests are great, but we accept all kinds of other help if you like. Such
- Write tutorials or blog posts. Blog, speak about, or create tutorials about one of GreptimeDB's many features. Mention [@greptime](https://twitter.com/greptime) on Twitter and email info@greptime.com so we can give pointers and tips and help you spread the word by promoting your content on Greptime communication channels.
- Improve the documentation. [Submit documentation](http://github.com/greptimeTeam/docs/) updates, enhancements, designs, or bug fixes, and fixing any spelling or grammar errors will be very much appreciated.
- Present at meetups and conferences about your GreptimeDB projects. Your unique challenges and successes in building things with GreptimeDB can provide great speaking material. We'd love to review your talk abstract, so get in touch with us if you'd like some help!
- Submit bug reports. To report a bug or a security issue, you can [open a new GitHub issue](https://github.com/GrepTimeTeam/greptimedb/issues/new).
- Submitting bug reports. To report a bug or a security issue, you can [open a new GitHub issue](https://github.com/GrepTimeTeam/greptimedb/issues/new).
- Speak up feature requests. Send feedback is a great way for us to understand your different use cases of GreptimeDB better. If you want to share your experience with GreptimeDB, or if you want to discuss any ideas, you can start a discussion on [GitHub discussions](https://github.com/GreptimeTeam/greptimedb/discussions), chat with the Greptime team on [Slack](https://greptime.com/slack), or you can tweet [@greptime](https://twitter.com/greptime) on Twitter.
## Code of Conduct
Also, there are things that we are not looking for because they don't match the goals of the product or benefit the community. Please read [Code of Conduct](https://github.com/GreptimeTeam/greptimedb/blob/develop/CODE_OF_CONDUCT.md); we hope everyone can keep good manners and become an honored member.
Also, there are things that we are not looking for because they don't match the goals of the product or benefit the community. Please read [Code of Conduct](https://github.com/GreptimeTeam/greptimedb/blob/main/CODE_OF_CONDUCT.md); we hope everyone can keep good manners and become an honored member.
## License
@@ -49,7 +49,8 @@ GreptimeDB uses the [Apache 2.0 license](https://github.com/GreptimeTeam/greptim
### Before PR
- To ensure that community is free and confident in its ability to use your contributions, please sign the Contributor License Agreement (CLA) which will be incorporated in the pull request process.
- Make sure all your codes are formatted and follow the [coding style](https://pingcap.github.io/style-guide/rust/).
- Make sure all files have proper license header (running `docker run --rm -v $(pwd):/github/workspace ghcr.io/korandoru/hawkeye-native:v3 format` from the project root).
- Make sure all your codes are formatted and follow the [coding style](https://pingcap.github.io/style-guide/rust/) and [style guide](http://github.com/greptimeTeam/docs/style-guide.md).
- Make sure all unit tests are passed (using `cargo test --workspace` or [nextest](https://nexte.st/index.html) `cargo nextest run`).
- Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings`).
@@ -81,7 +82,7 @@ Now, `pre-commit` will run automatically on `git commit`.
### Title
The titles of pull requests should be prefixed with category names listed in [Conventional Commits specification](https://www.conventionalcommits.org/en/v1.0.0)
like `feat`/`fix`/`docs`, with a concise summary of code change following. DO NOT use last commit message as pull request title.
like `feat`/`fix`/`docs`, with a concise summary of code change following. AVOID using the last commit message as pull request title.
### Description
@@ -100,7 +101,7 @@ of what you were trying to do and what went wrong. You can also reach for help i
## Community
The core team will be thrilled if you participate in any way you like. When you are stuck, try ask for help by filing an issue, with a detailed description of what you were trying to do and what went wrong. If you have any questions or if you would like to get involved in our community, please check out:
The core team will be thrilled if you would like to participate in any way you like. When you are stuck, try to ask for help by filing an issue, with a detailed description of what you were trying to do and what went wrong. If you have any questions or if you would like to get involved in our community, please check out:
- [GreptimeDB Community Slack](https://greptime.com/slack)
GreptimeDB is an open-source time-series database with a special focus on
scalability, analytical capabilities and efficiency. It's designed to work on
infrastructure of the cloud era, and users benefit from its elasticity and commodity
storage.
**GreptimeDB** is an open-source time-series database focusing on efficiency, scalability, and analytical capabilities.
Designed to work on infrastructure of the cloud era, GreptimeDB benefits users with its elasticity and commodity storage, offering a fast and cost-effective **alternative to InfluxDB** and a **long-term storage for Prometheus**.
Our core developers have been building time-series data platform
for years. Based on their best-practices, GreptimeDB is born to give you:
## Why GreptimeDB
- A standalone binary that scales to highly-available distributed cluster, providing a transparent experience for cluster users
- Optimized columnar layout for handling time-series data; compacted, compressed, and stored on various storage backends
- Flexible indexes, tackling high cardinality issues down
Seamless scalability from a standalone binary at edge to a robust, highly available distributed cluster in cloud, with a transparent experience for both developers and administrators.
* **Analyzing time-series data**
Query your time-series data with SQL and PromQL. Use Python scripts to facilitate complex analytical tasks.
* **Cloud-native distributed database**
Fully open-source distributed cluster architecture that harnesses the power of cloud-native elastic computing resources.
* **Performance and Cost-effective**
Flexible indexing capabilities and distributed, parallel-processing query engine, tackling high cardinality issues down. Optimized columnar layout for handling time-series data; compacted, compressed, and stored on various storage backends, particularly cloud object storage with 50x cost efficiency.
* **Compatible with InfluxDB, Prometheus and more protocols**
Widely adopted database protocols and APIs, including MySQL, PostgreSQL, and Prometheus Remote Storage, etc. [Read more](https://docs.greptime.com/user-guide/clients/overview).
- C/C++ Toolchain: provides basic tools for compiling and linking. This is
available as `build-essential` on ubuntu and similar name on other platforms.
- Rust: the easiest way to install Rust is to use
[`rustup`](https://rustup.rs/), which will check our `rust-toolchain` file and
install correct Rust version for you.
- Protobuf: `protoc` is required for compiling `.proto` files. `protobuf` is
available from major package manager on macos and linux distributions. You can
find an installation instructions [here](https://grpc.io/docs/protoc-installation/).
**Note that `protoc` version needs to be >= 3.15** because we have used the `optional`
keyword. You can check it with `protoc --version`.
- python3-dev or python3-devel(Optional feature, only needed if you want to run scripts
in CPython, and also need to enable `pyo3_backend` feature when compiling(by `cargo run -F pyo3_backend` or add `pyo3_backend` to src/script/Cargo.toml 's `features.default` like `default = ["python", "pyo3_backend]`)): this install a Python shared library required for running Python
scripting engine(In CPython Mode). This is available as `python3-dev` on
ubuntu, you can install it with `sudo apt install python3-dev`, or
`python3-devel` on RPM based distributions (e.g. Fedora, Red Hat, SuSE). Mac's
`Python3` package should have this shared library by default. More detail for compiling with PyO3 can be found in [PyO3](https://pyo3.rs/v0.18.1/building_and_distribution#configuring-the-python-version)'s documentation.
To install GreptimeDB locally, the recommended way is via Docker:
#### Build with Docker
A docker image with necessary dependencies is provided:
* Python toolchain (optional): Required only if built with PyO3 backend. More detail for compiling with PyO3 can be found in its [documentation](https://pyo3.rs/v0.18.1/building_and_distribution#configuring-the-python-version).
Build GreptimeDB binary:
```shell
make
```
Run a standalone server:
```shell
cargo run -- standalone start
```
Or if you built from docker:
```
docker run -p 4002:4002 -v "$(pwd):/tmp/greptimedb" greptime/greptimedb standalone start
```
Please see the online document site for more installation options and [operations info](https://docs.greptime.com/user-guide/operations/overview).
### Get started
Read the [complete getting started guide](https://docs.greptime.com/getting-started/try-out-greptimedb) on our [official document site](https://docs.greptime.com/).
To write and query data, GreptimeDB is compatible with multiple [protocols and clients](https://docs.greptime.com/user-guide/clients/overview).
For Linux and macOS, you can easily download pre-built binaries including official releases and nightly builds that are ready to use.
In most cases, downloading the version without PyO3 is sufficient. However, if you plan to run scripts in CPython (and use Python packages like NumPy and Pandas), you will need to download the version with PyO3 and install a Python with the same version as the Python in the PyO3 version.
We recommend using virtualenv for the installation process to manage multiple Python versions.
Please refer to [contribution guidelines](CONTRIBUTING.md) for more information.
Please refer to [contribution guidelines](CONTRIBUTING.md) and [internal concepts docs](https://docs.greptime.com/contributor-guide/overview.html) for more information.
## Acknowledgement
- GreptimeDB uses [Apache Arrow](https://arrow.apache.org/) as the memory model and [Apache Parquet](https://parquet.apache.org/) as the persistent file format.
- GreptimeDB's query engine is powered by [Apache Arrow DataFusion](https://github.com/apache/arrow-datafusion).
-[OpenDAL](https://github.com/datafuselabs/opendal) from [Datafuse Labs](https://github.com/datafuselabs) gives GreptimeDB a very general and elegant data access abstraction layer.
- GreptimeDB’s meta service is based on [etcd](https://etcd.io/).
- GreptimeDB uses [Apache Arrow™](https://arrow.apache.org/) as the memory model and [Apache Parquet™](https://parquet.apache.org/) as the persistent file format.
-GreptimeDB's query engine is powered by [Apache Arrow DataFusion™](https://arrow.apache.org/datafusion/).
- [Apache OpenDAL™](https://opendal.apache.org) gives GreptimeDBa very general and elegant data access abstraction layer.
- GreptimeDB's meta service is based on [etcd](https://etcd.io/).
- GreptimeDB uses [RustPython](https://github.com/RustPython/RustPython) for experimental embedded python scripting.
The wal benchmarker serves to evaluate the performance of GreptimeDB's Write-Ahead Log (WAL) component. It meticulously assesses the read/write performance of the WAL under diverse workloads generated by the benchmarker.
### How to use
To compile the benchmarker, navigate to the `greptimedb/benchmarks` directory and execute `cargo build --release`. Subsequently, you'll find the compiled target located at `greptimedb/target/release/wal_bench`.
The `./wal_bench -h` command reveals numerous arguments that the target accepts. Among these, a notable one is the `cfg-file` argument. By utilizing a configuration file in the TOML format, you can bypass the need to repeatedly specify cumbersome arguments.
| `prom_store.enable` | Bool | `true` | Whether to enable Prometheus remote write and read in HTTP API. |
| `prom_store.with_metric_engine` | Bool | `true` | Whether to store the data from Prometheus remote write in metric engine. |
| `wal` | -- | -- | The WAL options. |
| `wal.provider` | String | `raft_engine` | The provider of the WAL.<br/>- `raft_engine`: the wal is stored in the local file system by raft-engine.<br/>- `kafka`: it's remote wal that data is stored in Kafka. |
| `wal.dir` | String | `None` | The directory to store the WAL files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.file_size` | String | `256MB` | The size of the WAL segment file.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_threshold` | String | `4GB` | The threshold of the WAL size to trigger a flush.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_interval` | String | `10m` | The interval to trigger a flush.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.read_batch_size` | Integer | `128` | The read batch size.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.sync_write` | Bool | `false` | Whether to use sync write.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.enable_log_recycle` | Bool | `true` | Whether to reuse logically truncated log files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.prefill_log_files` | Bool | `false` | Whether to pre-create log files on start up.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.sync_period` | String | `10s` | Duration for fsyncing log files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.broker_endpoints` | Array | -- | The Kafka broker endpoints.<br/>**It's only used when the provider is `kafka`**. |
| `wal.max_batch_size` | String | `1MB` | The max size of a single producer batch.<br/>Warning: Kafka has a default limit of 1MB per message in a topic.<br/>**It's only used when the provider is `kafka`**. |
| `wal.linger` | String | `200ms` | The linger duration of a kafka batch producer.<br/>**It's only used when the provider is `kafka`**. |
| `wal.consumer_wait_timeout` | String | `100ms` | The consumer wait timeout.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_init` | String | `500ms` | The initial backoff delay.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_max` | String | `10s` | The maximum backoff delay.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_base` | Integer | `2` | The exponential backoff rate, i.e. next backoff = base * current backoff.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_deadline` | String | `5mins` | The deadline of retries.<br/>**It's only used when the provider is `kafka`**. |
| `storage` | -- | -- | The data storage options. |
| `storage.data_home` | String | `/tmp/greptimedb/` | The working home directory. |
| `storage.type` | String | `File` | The storage type used to store the data.<br/>- `File`: the data is stored in the local file system.<br/>- `S3`: the data is stored in the S3 object storage.<br/>- `Gcs`: the data is stored in the Google Cloud Storage.<br/>- `Azblob`: the data is stored in the Azure Blob Storage.<br/>- `Oss`: the data is stored in the Aliyun OSS. |
| `storage.cache_path` | String | `None` | Cache configuration for object storage such as 'S3' etc.<br/>The local file cache directory. |
| `storage.cache_capacity` | String | `None` | The local file cache capacity in bytes. |
| `storage.bucket` | String | `None` | The S3 bucket name.<br/>**It's only used when the storage type is `S3`, `Oss` and `Gcs`**. |
| `storage.root` | String | `None` | The S3 data will be stored in the specified prefix, for example, `s3://${bucket}/${root}`.<br/>**It's only used when the storage type is `S3`, `Oss` and `Azblob`**. |
| `storage.access_key_id` | String | `None` | The access key id of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3` and `Oss`**. |
| `storage.secret_access_key` | String | `None` | The secret access key of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3`**. |
| `storage.access_key_secret` | String | `None` | The secret access key of the aliyun account.<br/>**It's only used when the storage type is `Oss`**. |
| `storage.account_name` | String | `None` | The account key of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.account_key` | String | `None` | The account key of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.scope` | String | `None` | The scope of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.credential_path` | String | `None` | The credential path of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.container` | String | `None` | The container of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.sas_token` | String | `None` | The sas token of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.endpoint` | String | `None` | The endpoint of the S3 service.<br/>**It's only used when the storage type is `S3`, `Oss`, `Gcs` and `Azblob`**. |
| `storage.region` | String | `None` | The region of the S3 service.<br/>**It's only used when the storage type is `S3`, `Oss`, `Gcs` and `Azblob`**. |
| `[[region_engine]]` | -- | -- | The region engine options. You can configure multiple region engines. |
| `region_engine.mito.num_workers` | Integer | `8` | Number of region workers. |
| `region_engine.mito.worker_channel_size` | Integer | `128` | Request channel size of each worker. |
| `region_engine.mito.worker_request_batch_size` | Integer | `64` | Max batch size for a worker to handle requests. |
| `region_engine.mito.manifest_checkpoint_distance` | Integer | `10` | Number of meta action updated to trigger a new checkpoint for the manifest. |
| `region_engine.mito.compress_manifest` | Bool | `false` | Whether to compress manifest and checkpoint file by gzip (default false). |
| `region_engine.mito.max_background_jobs` | Integer | `4` | Max number of running background jobs |
| `region_engine.mito.auto_flush_interval` | String | `1h` | Interval to auto flush a region if it has not flushed yet. |
| `region_engine.mito.global_write_buffer_size` | String | `1GB` | Global write buffer size for all regions. If not set, it's default to 1/8 of OS memory with a max limitation of 1GB. |
| `region_engine.mito.global_write_buffer_reject_size` | String | `2GB` | Global write buffer size threshold to reject write requests. If not set, it's default to 2 times of `global_write_buffer_size` |
| `region_engine.mito.sst_meta_cache_size` | String | `128MB` | Cache size for SST metadata. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/32 of OS memory with a max limitation of 128MB. |
| `region_engine.mito.vector_cache_size` | String | `512MB` | Cache size for vectors and arrow arrays. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB. |
| `region_engine.mito.page_cache_size` | String | `512MB` | Cache size for pages of SST row groups. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB. |
| `region_engine.mito.scan_parallelism` | Integer | `0` | Parallelism to scan a region (default: 1/4 of cpu cores).<br/>- `0`: using the default value (1/4 of cpu cores).<br/>- `1`: scan in current thread.<br/>- `n`: scan in parallelism n. |
| `region_engine.mito.parallel_scan_channel_size` | Integer | `32` | Capacity of the channel to send data from parallel scan tasks to the main task. |
| `region_engine.mito.allow_stale_entries` | Bool | `false` | Whether to allow stale WAL entries read during replay. |
| `region_engine.mito.inverted_index` | -- | -- | The options for inverted index in Mito engine. |
| `region_engine.mito.inverted_index.create_on_flush` | String | `auto` | Whether to create the index on flush.<br/>- `auto`: automatically<br/>- `disable`: never |
| `region_engine.mito.inverted_index.create_on_compaction` | String | `auto` | Whether to create the index on compaction.<br/>- `auto`: automatically<br/>- `disable`: never |
| `region_engine.mito.inverted_index.apply_on_query` | String | `auto` | Whether to apply the index on query<br/>- `auto`: automatically<br/>- `disable`: never |
| `region_engine.mito.inverted_index.mem_threshold_on_create` | String | `64M` | Memory threshold for performing an external sort during index creation.<br/>Setting to empty will disable external sorting, forcing all sorting operations to happen in memory. |
| `region_engine.mito.inverted_index.intermediate_path` | String | `""` | File system path to store intermediate files for external sorting (default `{data_home}/index_intermediate`). |
| `region_engine.mito.memtable.index_max_keys_per_shard` | Integer | `8192` | The max number of keys in one shard.<br/>Only available for `partition_tree` memtable. |
| `region_engine.mito.memtable.data_freeze_threshold` | Integer | `32768` | The max rows of data inside the actively writing buffer in one shard.<br/>Only available for `partition_tree` memtable. |
| `region_engine.mito.memtable.fork_dictionary_bytes` | String | `1GiB` | Max dictionary bytes.<br/>Only available for `partition_tree` memtable. |
| `logging` | -- | -- | The logging options. |
| `logging.dir` | String | `/tmp/greptimedb/logs` | The directory to store the log files. |
| `logging.level` | String | `None` | The log level. Can be `info`/`debug`/`warn`/`error`. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.sendto`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.remote_write.url` | String | `""` | The url the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=information_schema`. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.sendto`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.remote_write.url` | String | `""` | The url the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=information_schema`. |
| `data_home` | String | `/tmp/metasrv/` | The working home directory. |
| `bind_addr` | String | `127.0.0.1:3002` | The bind address of metasrv. |
| `server_addr` | String | `127.0.0.1:3002` | The communication server address for frontend and datanode to connect to metasrv, "127.0.0.1:3002" by default for localhost. |
| `procedure.max_metadata_value_size` | String | `1500KiB` | Auto split large value<br/>GreptimeDB procedure uses etcd as the default metadata storage backend.<br/>The etcd the maximum size of any request is 1.5 MiB<br/>1500KiB = 1536KiB (1.5MiB) - 36KiB (reserved size of key)<br/>Comments out the `max_metadata_value_size`, for don't split large value (no limit). |
| `wal.topic_name_prefix` | String | `greptimedb_wal_topic` | A Kafka topic is constructed by concatenating `topic_name_prefix` and `topic_id`. |
| `wal.replication_factor` | Integer | `1` | Expected number of replicas of each partition. |
| `wal.create_topic_timeout` | String | `30s` | Above which a topic creation operation will be cancelled. |
| `wal.backoff_init` | String | `500ms` | The initial backoff for kafka clients. |
| `wal.backoff_max` | String | `10s` | The maximum backoff for kafka clients. |
| `wal.backoff_base` | Integer | `2` | Exponential backoff rate, i.e. next backoff = base * current backoff. |
| `wal.backoff_deadline` | String | `5mins` | Stop reconnecting if the total wait time reaches the deadline. If this config is missing, the reconnecting won't terminate. |
| `logging` | -- | -- | The logging options. |
| `logging.dir` | String | `/tmp/greptimedb/logs` | The directory to store the log files. |
| `logging.level` | String | `None` | The log level. Can be `info`/`debug`/`warn`/`error`. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.sendto`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.remote_write.url` | String | `""` | The url the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=information_schema`. |
| `mode` | String | `standalone` | The running mode of the datanode. It can be `standalone` or `distributed`. |
| `node_id` | Integer | `None` | The datanode identifier and should be unique in the cluster. |
| `require_lease_before_startup` | Bool | `false` | Start services after regions have obtained leases.<br/>It will block the datanode start if it can't receive leases in the heartbeat from metasrv. |
| `init_regions_in_background` | Bool | `false` | Initialize all regions in the background during the startup.<br/>By default, it provides services after all regions have been initialized. |
| `rpc_addr` | String | `127.0.0.1:3001` | The gRPC address of the datanode. |
| `rpc_hostname` | String | `None` | The hostname of the datanode. |
| `rpc_runtime_size` | Integer | `8` | The number of gRPC server worker threads. |
| `rpc_max_recv_message_size` | String | `512MB` | The maximum receive message size for gRPC server. |
| `rpc_max_send_message_size` | String | `512MB` | The maximum send message size for gRPC server. |
| `wal.provider` | String | `raft_engine` | The provider of the WAL.<br/>- `raft_engine`: the wal is stored in the local file system by raft-engine.<br/>- `kafka`: it's remote wal that data is stored in Kafka. |
| `wal.dir` | String | `None` | The directory to store the WAL files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.file_size` | String | `256MB` | The size of the WAL segment file.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_threshold` | String | `4GB` | The threshold of the WAL size to trigger a flush.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.purge_interval` | String | `10m` | The interval to trigger a flush.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.read_batch_size` | Integer | `128` | The read batch size.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.sync_write` | Bool | `false` | Whether to use sync write.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.enable_log_recycle` | Bool | `true` | Whether to reuse logically truncated log files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.prefill_log_files` | Bool | `false` | Whether to pre-create log files on start up.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.sync_period` | String | `10s` | Duration for fsyncing log files.<br/>**It's only used when the provider is `raft_engine`**. |
| `wal.broker_endpoints` | Array | -- | The Kafka broker endpoints.<br/>**It's only used when the provider is `kafka`**. |
| `wal.max_batch_size` | String | `1MB` | The max size of a single producer batch.<br/>Warning: Kafka has a default limit of 1MB per message in a topic.<br/>**It's only used when the provider is `kafka`**. |
| `wal.linger` | String | `200ms` | The linger duration of a kafka batch producer.<br/>**It's only used when the provider is `kafka`**. |
| `wal.consumer_wait_timeout` | String | `100ms` | The consumer wait timeout.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_init` | String | `500ms` | The initial backoff delay.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_max` | String | `10s` | The maximum backoff delay.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_base` | Integer | `2` | The exponential backoff rate, i.e. next backoff = base * current backoff.<br/>**It's only used when the provider is `kafka`**. |
| `wal.backoff_deadline` | String | `5mins` | The deadline of retries.<br/>**It's only used when the provider is `kafka`**. |
| `storage` | -- | -- | The data storage options. |
| `storage.data_home` | String | `/tmp/greptimedb/` | The working home directory. |
| `storage.type` | String | `File` | The storage type used to store the data.<br/>- `File`: the data is stored in the local file system.<br/>- `S3`: the data is stored in the S3 object storage.<br/>- `Gcs`: the data is stored in the Google Cloud Storage.<br/>- `Azblob`: the data is stored in the Azure Blob Storage.<br/>- `Oss`: the data is stored in the Aliyun OSS. |
| `storage.cache_path` | String | `None` | Cache configuration for object storage such as 'S3' etc.<br/>The local file cache directory. |
| `storage.cache_capacity` | String | `None` | The local file cache capacity in bytes. |
| `storage.bucket` | String | `None` | The S3 bucket name.<br/>**It's only used when the storage type is `S3`, `Oss` and `Gcs`**. |
| `storage.root` | String | `None` | The S3 data will be stored in the specified prefix, for example, `s3://${bucket}/${root}`.<br/>**It's only used when the storage type is `S3`, `Oss` and `Azblob`**. |
| `storage.access_key_id` | String | `None` | The access key id of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3` and `Oss`**. |
| `storage.secret_access_key` | String | `None` | The secret access key of the aws account.<br/>It's **highly recommended** to use AWS IAM roles instead of hardcoding the access key id and secret key.<br/>**It's only used when the storage type is `S3`**. |
| `storage.access_key_secret` | String | `None` | The secret access key of the aliyun account.<br/>**It's only used when the storage type is `Oss`**. |
| `storage.account_name` | String | `None` | The account key of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.account_key` | String | `None` | The account key of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.scope` | String | `None` | The scope of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.credential_path` | String | `None` | The credential path of the google cloud storage.<br/>**It's only used when the storage type is `Gcs`**. |
| `storage.container` | String | `None` | The container of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.sas_token` | String | `None` | The sas token of the azure account.<br/>**It's only used when the storage type is `Azblob`**. |
| `storage.endpoint` | String | `None` | The endpoint of the S3 service.<br/>**It's only used when the storage type is `S3`, `Oss`, `Gcs` and `Azblob`**. |
| `storage.region` | String | `None` | The region of the S3 service.<br/>**It's only used when the storage type is `S3`, `Oss`, `Gcs` and `Azblob`**. |
| `[[region_engine]]` | -- | -- | The region engine options. You can configure multiple region engines. |
| `region_engine.mito.num_workers` | Integer | `8` | Number of region workers. |
| `region_engine.mito.worker_channel_size` | Integer | `128` | Request channel size of each worker. |
| `region_engine.mito.worker_request_batch_size` | Integer | `64` | Max batch size for a worker to handle requests. |
| `region_engine.mito.manifest_checkpoint_distance` | Integer | `10` | Number of meta action updated to trigger a new checkpoint for the manifest. |
| `region_engine.mito.compress_manifest` | Bool | `false` | Whether to compress manifest and checkpoint file by gzip (default false). |
| `region_engine.mito.max_background_jobs` | Integer | `4` | Max number of running background jobs |
| `region_engine.mito.auto_flush_interval` | String | `1h` | Interval to auto flush a region if it has not flushed yet. |
| `region_engine.mito.global_write_buffer_size` | String | `1GB` | Global write buffer size for all regions. If not set, it's default to 1/8 of OS memory with a max limitation of 1GB. |
| `region_engine.mito.global_write_buffer_reject_size` | String | `2GB` | Global write buffer size threshold to reject write requests. If not set, it's default to 2 times of `global_write_buffer_size` |
| `region_engine.mito.sst_meta_cache_size` | String | `128MB` | Cache size for SST metadata. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/32 of OS memory with a max limitation of 128MB. |
| `region_engine.mito.vector_cache_size` | String | `512MB` | Cache size for vectors and arrow arrays. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB. |
| `region_engine.mito.page_cache_size` | String | `512MB` | Cache size for pages of SST row groups. Setting it to 0 to disable the cache.<br/>If not set, it's default to 1/16 of OS memory with a max limitation of 512MB. |
| `region_engine.mito.scan_parallelism` | Integer | `0` | Parallelism to scan a region (default: 1/4 of cpu cores).<br/>- `0`: using the default value (1/4 of cpu cores).<br/>- `1`: scan in current thread.<br/>- `n`: scan in parallelism n. |
| `region_engine.mito.parallel_scan_channel_size` | Integer | `32` | Capacity of the channel to send data from parallel scan tasks to the main task. |
| `region_engine.mito.allow_stale_entries` | Bool | `false` | Whether to allow stale WAL entries read during replay. |
| `region_engine.mito.inverted_index` | -- | -- | The options for inverted index in Mito engine. |
| `region_engine.mito.inverted_index.create_on_flush` | String | `auto` | Whether to create the index on flush.<br/>- `auto`: automatically<br/>- `disable`: never |
| `region_engine.mito.inverted_index.create_on_compaction` | String | `auto` | Whether to create the index on compaction.<br/>- `auto`: automatically<br/>- `disable`: never |
| `region_engine.mito.inverted_index.apply_on_query` | String | `auto` | Whether to apply the index on query<br/>- `auto`: automatically<br/>- `disable`: never |
| `region_engine.mito.inverted_index.mem_threshold_on_create` | String | `64M` | Memory threshold for performing an external sort during index creation.<br/>Setting to empty will disable external sorting, forcing all sorting operations to happen in memory. |
| `region_engine.mito.inverted_index.intermediate_path` | String | `""` | File system path to store intermediate files for external sorting (default `{data_home}/index_intermediate`). |
| `region_engine.mito.memtable.index_max_keys_per_shard` | Integer | `8192` | The max number of keys in one shard.<br/>Only available for `partition_tree` memtable. |
| `region_engine.mito.memtable.data_freeze_threshold` | Integer | `32768` | The max rows of data inside the actively writing buffer in one shard.<br/>Only available for `partition_tree` memtable. |
| `region_engine.mito.memtable.fork_dictionary_bytes` | String | `1GiB` | Max dictionary bytes.<br/>Only available for `partition_tree` memtable. |
| `logging` | -- | -- | The logging options. |
| `logging.dir` | String | `/tmp/greptimedb/logs` | The directory to store the log files. |
| `logging.level` | String | `None` | The log level. Can be `info`/`debug`/`warn`/`error`. |
| `logging.append_stdout` | Bool | `true` | Whether to append logs to stdout. |
| `logging.tracing_sample_ratio` | -- | -- | The percentage of tracing will be sampled and exported.<br/>Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1.<br/>ratio > 1 are treated as 1. Fractions <0aretreatedas0|
|`export_metrics`|--|--|ThedatanodecanexportitsmetricsandsendtoPrometheuscompatibleservice(e.g.sendto`greptimedb`itself)fromremote-writeAPI.<br/>This is only used for `greptimedb` to export its own metrics internally. It's different from prometheus scrape. |
| `export_metrics.remote_write.url` | String | `""` | The url the metrics send to. The url example can be: `http://127.0.0.1:4000/v1/prometheus/write?db=information_schema`. |
@@ -79,7 +79,7 @@ This RFC proposes to add a new expression node `MergeScan` to merge result from
│ │ │ │
└─Frontend──────┘ └─Remote-Sources──────────────┘
```
This merge operation simply chains all the the underlying remote data sources and return `RecordBatch`, just like a coalesce op. And each remote sources is a gRPC query to datanode via the substrait logical plan interface. The plan is transformed and divided from the original query that comes to frontend.
This merge operation simply chains all the underlying remote data sources and return `RecordBatch`, just like a coalesce op. And each remote sources is a gRPC query to datanode via the substrait logical plan interface. The plan is transformed and divided from the original query that comes to frontend.
This RFC proposes an optimization towards the storage engine by introducing an inverted indexing methodology aimed at optimizing label selection queries specifically pertaining to Metrics with tag columns as the target for optimization.
# Introduction
In the current system setup, in the Mito Engine, the first column of Primary Keys has a Min-Max index, which significantly optimizes the outcome. However, there are limitations when it comes to other columns, primarily tags. This RFC suggests the implementation of an inverted index to provide enhanced filtering benefits to bridge these limitations and improve overall system performance.
# Design Detail
## Inverted Index
The primary aim of the proposed inverted index is to optimize tag columns in the SST Parquet Files within the Mito Engine. The mapping and construction of an inverted index, from Tag Values to Row Groups, enables efficient logical structures that provide faster and more flexible queries.
When scanning SST Files, pushed-down filters applied to a respective Tag's inverted index, determine the final Row Groups to be indexed and scanned, further bolstering the speed and efficiency of data retrieval processes.
## Index Format
The Inverted Index for each SST file follows the format shown below:
The `footer_payload` is presented in protobuf encoding of `InvertedIndexFooter`.
The complete format is containerized in [Puffin](https://iceberg.apache.org/puffin-spec/) with the type defined as `greptime-inverted-index-v1`.
## Protobuf Details
The `InvertedIndexFooter` is defined in the following protobuf structure:
```protobuf
messageInvertedIndexFooter{
repeatedInvertedIndexMetametas;
}
messageInvertedIndexMeta{
stringname;
uint64row_count_in_group;
uint64fst_offset;
uint64fst_size;
uint64null_bitmap_offset;
uint64null_bitmap_size;
InvertedIndexStatsstats;
}
messageInvertedIndexStats{
uint64null_count;
uint64distinct_count;
bytesmin_value;
bytesmax_value;
}
```
## Bitmap
Bitmaps are used to represent indices of fixed-size groups. Rows are divided into groups of a fixed size, defined in the `InvertedIndexMeta` as `row_count_in_group`.
For example, when `row_count_in_group` is `4096`, it means each group has `4096` rows. If there are a total of `10000` rows, there will be `3` groups in total. The first two groups will have `4096` rows each, and the last group will have `1808` rows. If the indexed values are found in row `200` and `9000`, they will correspond to groups `0` and `2`, respectively. Therefore, the bitmap should show `0` and `2`.
Bitmap is implemented using [BitVec](https://docs.rs/bitvec/latest/bitvec/), selected due to its efficient representation of dense data arrays typical of indices of groups.
## Finite State Transducer (FST)
[FST](https://docs.rs/fst/latest/fst/) is a highly efficient data structure ideal for in-memory indexing. It represents ordered sets or maps where the keys are bytes. The choice of the FST effectively balances the need for performance, space efficiency, and the ability to perform complex analyses such as regular expression matching.
The conventional usage of FST and `u64` values has been adapted to facilitate indirect indexing to row groups. As the row groups are represented as Bitmaps, we utilize the `u64` values split into bitmap's offset (higher 32 bits) and size (lower 32 bits) to represent the location of these Bitmaps.
## API Design
Two APIs `InvertedIndexBuilder` for building indexes and `InvertedIndexSearcher` for querying indexes are designed:
**Only the red nodes will persist state after it has succeeded**, and other nodes won't persist state. (excluding the Start and End nodes).
## Steps
**The persistent context:** It's shared in each step and available after recovering. It will only be updated/stored after the Red node has succeeded.
Values:
-`region_id`: The target leader region.
-`peer`: The target datanode.
-`close_old_leader`: Indicates whether close the region.
-`leader_may_unreachable`: It's used to support the failover procedure.
**The Volatile context:** It's shared in each step and available in executing (including retrying). It will be dropped if the procedure runner crashes.
### Select Candidate
The Persistent state: Selected Candidate Region.
### Update Metadata(Down)
**The Persistent context:**
- The (latest/updated) `version` of `TableRouteValue`, It will be used in the step of `Update Metadata(Up)`.
### Downgrade Leader
This step sends an instruction via heartbeat and performs:
1. Downgrades leader region.
2. Retrieves the `last_entry_id` (if available).
If the target leader region is not found:
- Sets `close_old_leader` to true.
- Sets `leader_may_unreachable` to true.
If the target Datanode is unreachable:
- Waits for region lease expired.
- Sets `close_old_leader` to true.
- Sets `leader_may_unreachable` to true.
**The Persistent context:**
None
**The Persistent state:**
-`last_entry_id`
*Passes to next step.
### Upgrade Candidate
This step sends an instruction via heartbeat and performs:
1. Replays the WAL to latest(`last_entry_id`).
2. Upgrades the candidate region.
If the target region is not found:
- Rollbacks.
- Notifies the failover detector if `leader_may_unreachable` == true.
- Exits procedure.
If the target Datanode is unreachable:
- Rollbacks.
- Notifies the failover detector if `leader_may_unreachable` == true.
- Exits procedure.
**The Persistent context:**
None
### Update Metadata(Up)
This step performs
1. Switches Leader.
2. Removes Old Leader(Opt.).
3. Moves Old Leader to follower(Opt.).
The `TableRouteValue` version should equal the `TableRouteValue`'s `version` in Persistent context. Otherwise, verifies whether `TableRouteValue` already updated.
**The Persistent context:**
None
### Close Old Leader(Opt.)
This step sends a close region instruction via heartbeat.
If the target leader region is not found:
- Ignore.
If the target Datanode is unreachable:
- Ignore.
### Open Candidate(Opt.)
This step sends an open region instruction via heartbeat and waits for conditions to be met (typically, the condition is that the `last_entry_id` of the Candidate Region is very close to that of the Leader Region or the latest).
This RFC proposes to enclose the usage of `ColumnId` into the region engine only.
# Motivation
`ColumnId` is an identifier for columns. It's assigned by meta server, stored in `TableInfo` and `RegionMetadata` and used in region engine to distinguish columns.
At present, Both Frontend, Datanode and Metasrv are aware of `ColumnId` but it's only used in region engine. Thus this RFC proposes to remove it from Frontend (mainly used in `TableInfo`) and Metasrv.
# Details
`ColumnId` is used widely on both read and write paths. Removing it from Frontend and Metasrv implies several things:
- A column may have different column id in different regions.
- A column is identified by its name in all components.
- Column order in the region engine is not restricted, i.e., no need to be in the same order with table info.
The first thing doesn't matter IMO. This concept doesn't exist anymore outside of region server, and each region is autonomous and independent -- the only guarantee it should hold is those columns exist. But if we consider region repartition, where the SST file would be re-assign to different regions, things would become a bit more complicated. A possible solution is store the relation between name and ColumnId in the manifest, but it's out of the scope of this RFC. We can likely give a workaround by introducing a indirection mapping layer of different version of partitions.
And more importantly, we can still assume columns have the same column ids across regions. We have procedure to maintain consistency between regions and the region engine should ensure alterations are idempotent. So it is possible that region repartition doesn't need to consider column ids or other region metadata in the future.
Users write and query column by their names, not by ColumnId or something else. The second point also means to change the column reference in ScanRequest from index to name. This change can hugely alleviate the misuse of the column index, which has given us many surprises.
And for the last one, column order only matters in table info. This order is used in user-faced table structure operation, like add column, describe column or as the default order of INSERT clause. None of them is connected with the order in storage.
# Drawback
Firstly, this is a breaking change. Delivering this change requires a full upgrade of the cluster. Secondly, this change may introduce some performance regression. For example, we have to pass the full table name in the `ScanRequest` instead of the `ColumnId`. But this influence is very limited, since the column index is only used in the region engine.
# Alternatives
There are two alternatives from the perspective of "what can be used as the column identifier":
- Index of column to the table schema
-`ColumnId` of that column
The first one is what we are using now. By choosing this way, it's required to keep the column order in the region engine the same as the table info. This is not hard to achieve, but it's a bit annoying. And things become tricky when there is internal column or different schemas like those stored in file format. And this is the initial purpose of this RFC, which is trying to decouple the table schema and region schema.
The second one, in other hand, requires the `ColumnId` should be identical in all regions and `TableInfo`. It has the same drawback with the previous alternative, that the `TableInfo` and `RegionMetadata` are tighted together. Another point is that the `ColumnId` is assigned by the Metasrv, who doesn't need it but have to maintain it. And this also limits the functionality of `ColumnId`, by taking the ability of assigning it from concrete region engine.
This RFC proposes a Lightweight Module for executing continuous aggregation queries on a stream of data.
# Motivation
Being able to do continuous aggregation is a very powerful tool. It allows you to do things like:
1. downsample data from i.e. 1 milliseconds to 1 second
2. calculate the average of a stream of data
3. Keeping a sliding window of data in memory
In order to do those things while maintaining a low memory footprint, you need to be able to manage the data in a smart way. Hence, we only store necessary data in memory, and send/recv data deltas to/from the client.
# Details
## System boundary / What it's and isn't
- GreptimeFlow provides a way to perform continuous aggregation over time-series data.
- It's not a complete streaming-processing system. Only a must subset functionalities are provided.
- Flow can process a configured range of fresh data. Data exceeding this range will be dropped directly. Thus it cannot handle random datasets (random on timestamp).
- Both sliding windows (e.g., latest 5m from present) and fixed windows (every 5m from some time) are supported. And these two are the major targeting scenarios.
- Flow can handle most aggregate operators within one table(i.e. Sum, avg, min, max and comparison operators). But others (join, trigger, txn etc.) are not the target feature.
## Framework
- Greptime Flow's is built on top of [Hydroflow](https://github.com/hydro-project/hydroflow).
- We have three choices for the Dataflow/Streaming process framework for our simple continuous aggregation feature:
1. Based on the timely/differential dataflow crate that [materialize](https://github.com/MaterializeInc/materialize) based on. Later, it's proved too obscure for a simple usage, and is hard to customize memory usage control.
2. Based on a simple dataflow framework that we write from ground up, like what [arroyo](https://www.arroyo.dev/) or [risingwave](https://www.risingwave.dev/) did, for example the core streaming logic of [arroyo](https://github.com/ArroyoSystems/arroyo/blob/master/arroyo-datastream/src/lib.rs) only takes up to 2000 line of codes. However, it means maintaining another layer of dataflow framework, which might seem easy in the beginning, but I fear it might be too burdensome to maintain once we need more features.
3. Based on a simple and lower level dataflow framework that someone else write, like [hydroflow](https://github.com/hydro-project/hydroflow), this approach combines the best of both worlds. Firstly, it boasts ease of comprehension and customization. Secondly, the dataflow framework offers precisely the necessary features for crafting uncomplicated single-node dataflow programs while delivering decent performance.
Hence, we choose the third option, and use a simple logical plan that's anagonistic to the underlying dataflow framework, as it only describe how the dataflow graph should be doing, not how it do that. And we built operator in hydroflow to execute the plan. And the result hydroflow graph is wrapped in a engine that only support data in/out and tick event to flush and compute the result. This provide a thin middle layer that's easy to maintain and allow switching to other dataflow framework if necessary.
## Deploy mode and protocol
- Greptime Flow is an independent streaming compute component. It can be used either within a standalone node or as a dedicated node at the same level as frontend in distributed mode.
- It accepts insert request Rows, which is used between frontend and datanode.
- New flow job is submitted in the format of modified SQL query like snowflake do, like: `CREATE TASK avg_over_5m WINDOW_SIZE = "5m" AS SELECT avg(value) FROM table WHERE time > now() - 5m GROUP BY time(1m)`. Flow job then got stored in Metasrv.
- It also persists results in the format of Rows to frontend.
- The query plan uses Substrait as codec format. It's the same with GreptimeDB's query engine.
- Greptime Flow needs a WAL for recovering. It's possible to reuse datanode's.
The workflow is shown in the following diagram
```mermaid
graph TB
subgraph Flownode["Flownode"]
subgraph Dataflows
df1("Dataflow_1")
df2("Dataflow_2")
end
end
subgraph Frontend["Frontend"]
newLines["Mirror Insert
Create Task From Query
Write result from flow node"]
end
subgraph Datanode["Datanode"]
end
User --> Frontend
Frontend -->|Register Task| Metasrv
Metasrv -->|Read Task Metadata| Frontend
Frontend -->|Create Task| Flownode
Frontend -->|Mirror Insert| Flownode
Flownode -->|Write back| Frontend
Frontend --> Datanode
Datanode --> Frontend
```
## Lifecycle of data
- New data is inserted into frontend like before. Frontend will mirror insert request to Flow node if there is configured flow job.
- Depending on the timestamp of incoming data, flow will either drop it (outdated data) or process it (fresh data).
- Greptime Flow will periodically write results back to the result table through frontend.
- Those result will then be written into a result table stored in datanode.
- A small table of intermediate state is kept in memory, which is used to calculate the result.
## Supported operations
- Greptime Flow accepts a configurable "materialize window", data point exceeds that time window is discarded.
- Data within that "materialize window" is queryable and updateable.
- Greptime Flow can handle partitioning, if and only if the input query can be transformed to a fully partitioned plan according to the existing commutative rules. Otherwise the corresponding flow job has to be calculated in a single node.
- Notice that Greptime Flow has to see all the data belongs to one partition.
- Deletion and duplicate insertion are not supported at early stage.
## Miscellaneous
- Greptime Flow can translate SQL to it's own plan, however only a selected few aggregate function is supported for now, like min/max/sum/count/avg
- Greptime Flow's operator is configurable in terms of the size of the materialize window, whether to allow delay of incoming data etc., so simplest operator can choose to not tolerate any delay to save memory.
# Future Work
- Support UDF that can do one-to-one mapping. Preferably, we can reuse the UDF mechanism in GreptimeDB.
- Support join operator.
- Design syntax for config operator for different materialize window and delay tolerance.
- Support cross partition merge operator that allows complex query plan that not necessary accord with partitioning rule to communicate between nodes and create final materialize result.
- Duplicate insertion, which can be reverted easily within the current framework, so supporting it could be easy
- Deletion within "materialize window", this requires operators like min/max to store all inputs within materialize window, which might require further optimization.
A new region partition scheme that runs on multiple dimensions of the key space. The partition rule is defined by a set of simple expressions on the partition key columns.
# Motivation
The current partition rule is from MySQL's [`RANGE Partition`](https://dev.mysql.com/doc/refman/8.0/en/partitioning-range.html), which is based on a single dimension. It is sort of a [Hilbert Curve](https://en.wikipedia.org/wiki/Hilbert_curve) and pick several point on the curve to divide the space. It is neither easy to understand how the data get partitioned nor flexible enough to handle complex partitioning requirements.
Considering the future requirements like region repartitioning or autonomous rebalancing, where both workload and partition may change frequently. Here proposes a new region partition scheme that uses a set of simple expressions on the partition key columns to divide the key space.
# Details
## Partition rule
First, we define a simple expression that can be used to define the partition rule. The simple expression is a binary expression expression on the partition key columns that can be evaluated to a boolean value. The binary operator is limited to comparison operators only, like `=`, `!=`, `>`, `>=`, `<`, `<=`. And the operands are limited either literal value or partition column.
Example of valid simple expressions are $`col_A = 10`$, $`col_A \gt 10 \& col_B \gt 20`$ or $`col_A \ne 10`$.
Those expressions can be used as predicates to divide the key space into different regions. The following example have two partition columns `Col A` and `Col B`, and four partitioned regions.
An advantage of this scheme is that it is easy to understand how the data get partitioned. The above example can be visualized in a 2D space (two partition column is involved in the example).

Here each expression draws a line in the 2D space. Managing data partitioning becomes a matter of drawing lines in the key space.
To make it easy to use, there is a "default region" which catches all the data that doesn't match any of previous expressions. The default region exist by default and do not need to specify. It is also possible to remove this default region if the DB finds it is not necessary.
## SQL interface
The SQL interface is in response to two parts: specifying the partition columns and the partition rule. Thouth we are targeting an autonomous system, it's still allowed to give some bootstrap rules or hints on creating table.
Partition column is specified by `PARTITION ON COLUMNS` sub-clause in `CREATE TABLE`:
```sql
CREATETABLEt(...)
PARTITIONONCOLUMNS(...)();
```
Two following brackets are for partition columns and partition rule respectively.
Columns provided here are only used as an allow-list of how the partition rule can be defined. Which means (a) the sequence between columns doesn't matter, (b) the columns provided here are not necessarily being used in the partition rule.
The partition rule part is a list of comma-separated simple expressions. Expressions here are not corresponding to region, as they might be changed by system to fit various workload.
A full example of `CREATE TABLE` with partition rule is:
```sql
CREATETABLEIFNOTEXISTSdemo(
aSTRING,
bSTRING,
cSTRING,
dSTRING,
tsTIMESTAMP,
memoryDOUBLE,
TIMEINDEX(ts),
PRIMARYKEY(a,b,c,d)
)
PARTITIONONCOLUMNS(c,b,a)(
a<10,
10>=aANDa<20,
20>=aANDb<100,
20>=aANDb>100
)
```
## Combine with storage
Examining columns separately suits our columnar storage very well in two aspects.
1. The simple expression can be pushed down to storage and file format, and is likely to hit existing index. Makes pruning operation very efficient.
2. Columns in columnar storage are not tightly coupled like in the traditional row storages, which means we can easily add or remove columns from partition rule without much impact (like a global reshuffle) on data.
The data file itself can be "projected" to the key space as a polyhedron, it is guaranteed that each plane is in parallel with some coordinate planes (in a 2D scenario, this is saying that all the files can be projected to a rectangle). Thus partition or repartition also only need to consider related columns.

An additional limitation is that considering how the index works and how we organize the primary keys at present, the partition columns are limited to be a subset of primary keys for better performance.
This style guide is intended to help contributors to GreptimeDB write code that is consistent with the rest of the codebase. It is a living document and will be updated as the codebase evolves.
It's mainly an complement to the [Rust Style Guide](https://pingcap.github.io/style-guide/rust/).
## Table of Contents
- Formatting
- Modules
- Comments
## Formatting
- Place all `mod` declaration before any `use`.
- Use `unimplemented!()` instead of `todo!()` for things that aren't likely to be implemented.
- Add an empty line before and after declaration blocks.
- Place comment before attributes (`#[]`) and derive (`#[derive]`).
## Modules
- Use the file with same name instead of `mod.rs` to define a module. E.g.:
```
.
├── cache
│ ├── cache_size.rs
│ └── write_cache.rs
└── cache.rs
```
## Comments
- Add comments for public functions and structs.
- Prefer document comment (`///`) over normal comment (`//`) for structs, fields, functions etc.
- Add link (`[]`) to struct, method, or any other reference. And make sure that link works.
## Error handling
- Define a custom error type for the module if needed.
- Prefer `with_context()` over `context()` when allocation is needed to construct an error.
- Use `error!()` or `warn!()` macros in the `common_telemetry` crate to log errors. E.g.:
Status notify: we are still working on this config. It's expected to change frequently in the recent days. Please feel free to submit your feedback and/or contribution to this dashboard 🤗
# How to use
Open Grafana Dashboard page, choose `New` -> `Import`. And upload `greptimedb.json` file.
@@ -162,7 +324,7 @@ impl DataSource for InformationTableDataSource {
letstream=self
.table
.to_stream()
.to_stream(request)
.map_err(BoxedError::new)
.context(TablesRecordBatchSnafu)
.map_err(BoxedError::new)?
@@ -171,11 +333,13 @@ impl DataSource for InformationTableDataSource {
None=>batch,
});
letstream=RecordBatchStreamAdaptor{
letstream=RecordBatchStreamWrapper{
schema: projected_schema,
stream: Box::pin(stream),
output_ordering: None,
metrics: Default::default(),
};
Ok(Box::pin(stream))
}
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.