* feat: add initial implementation for status endpoint
* feat(status_endpoint): add more data to response
* feat(status_endpoint): use build data env vars
* feat(status_endpoint): add simple test
* fix(status_endpoint): adjust the toml indentation
* fix: set max_files_in_l0 in unit tests to avoid compaction
* refactor: pass while EngineConfig
* fix: comment out unstable sqlness test
* revert commented sqlness
* add some debug log
* fix: use lazy parquet reader in MitoTable::scan_to_stream to avoid IO in plan stage
* fix: unit tests
* fix: order-by optimization
* add some tests
* fix: move metric names to metrics.rs
* fix: some cr comments
* refactor: Remove MySQL related options from Datanode
remove mysql_addr and mysql_runtime_size in datanode.rs, remove command line argument mysql_addr in cmd/src/datanode.rs
#1739
* feat: remove --mysql-addr from command line
in pre commit, sqlness can not find --mysql-addrr, because we remove it
issue#1739
* refactor: remove --mysql-addr from command line
in pre commit, sqlness can not find --mysql-addrr, because we remove it
issue#1739
* feat: Add column supports at first or after the existing columns
* Update src/common/query/Cargo.toml
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: add SqlDialect to query context
* feat: use session in postgrel handlers
* chore: refactor sql dialect
* feat: use different dialects for different sql protocols
* feat: adds GreptimeDbDialect
* refactor: replace GenericDialect with GreptimeDbDialect
* feat: save user info to session
* fix: compile error
* fix: test
* feat: using distributed lock to guard against the concurrent updating of table metadatas in region failover procedure
* fix: resolve PR comments
* fix: resolve PR comments
* feat(servers): Export process metrics
* chore: update metrics related deps to get the process-metrics printed
The latest process-metrics crate depends on metrics 0.21, we use metrics
0.20. This cause the process-metrics crate doesn't record the metrics
when use metrics macros
When I use docker build to build the image, I get an error that pip is missing. Add install python3-pip in Dockerfile.
Fixes: #1643
Signed-off-by: yaoyinnan <yaoyinnan@foxmail.com>
* fix: fix close region issue
* chore: apply suggestion from CR
* chore: apply suggestion from CR
* chore: apply suggestion from CR
* chore: apply suggestion from CR
* refactor: remove close method from Region trait
* chore: remove PartialEq from CloseTableResult
* feat(storage): Add AllocTracker
* feat(storage): flush request wip
* feat(storage): support global write buffer size
* fix(storage): Test and fix size based strategy
* test(storage): Test AllocTracker
* test(storage): Test pick_by_write_buffer_full
* docs: Add flush config example
* test(storage): Test schedule_engine_flush
* feat(storage): Add metrics for write buffer size
* chore(flush): Add log when triggering flush by global buffer
* chore(storage): track allocation in update_stats
* feat: add timezone info to query context
* feat: parse mysql compatible time zone string
* feat: add method to timestamp for rendering timezone aware string
* feat: use timezone from session for time string rendering
* refactor: use querycontectref
* feat: implement session/timezone variable read/write
* style: resolve toml format
* test: update tests
* Apply suggestions from code review
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* Update src/session/src/context.rs
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* refactor: address review issues
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* fix(table-procedure): on_register_catalog should use open_table
* test: Test recover RegisterCatalog state
* test: Fix subprocedure does not execute in test
* feat(mito): adjust procedure log level
* refactor: rename execute_parent_procedure
execute_parent_procedure -> execute_until_suspended_or_done
* feat: add delete WAL in drop_region
* chore: fix typo err.
* feat: mark all SSTs deleted and remove the region from StorageEngine's region map.
* test: add test_drop_region for StorageEngine.
* chore: make clippy happy
* fix: fix conflict
* chore: CR.
* chore: CR
* chore: fix clippy
* fix: temp file life time
* fix: remove region number validation
* Update src/mito/src/engine.rs
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: support frontend heartbeat
* fix: typo "reponse" -> "response"
* add ut
* enable start heartbeat task
* chore: frontend id is specified by metasrv, not in the frontend startup parameter
* fix typo
* self-cr
* cr
* cr
* cr
* remove unnecessary headers
* use the member id in the header as the node id
* feat: Add FlushPicker
* feat(storage): Add close to StorageEngine
* style(storage): fix clippy
* feat(storage): Close regions in StorageEngine::close
* chore(storage): Clear requests on scheduler stop
* test(storage): Test flush picker
* feat(storage): Add metrics for auto flush
* feat(storage): Add flush reason and record it in metrics
* feat: Expose flush config
docs(config): Update config example
* refactor(storage): Run auto flush task in FlushScheduler
* refactor(storage): Add FlushItem trait to make FlushPicker easy to test
* feat: add storage engine region count gauge
* test: remove catalog metrics because we can't get a correct number
* feat: add metrics for log store write and compaction
* fix: address review issues
* test: simplify countdownlatch
* feat: impl Drop for LocalScheduler
* feat(storage): Impl FlushRequest and FlushHandler
* feat(storage): Use scheduler to handle flush job
* chore(storage): remove unused code
* feat(storage): Use new type pattern for RegionMap
* feat(storage): Remove on_success callback
* feat(storage): Address CR comments and add some metrics to flush
* feat(common-procedure): pub(crate) use proc_path
* feat(common-procedure): Implement delete_procedure
* feat(common-procedure): Clean procedure after it is finished
* chore(common-procedure): put path_string in front of try_stream
* test(common-procedure): Test cleaning up procedures
* feat(common-procedure): Clean procedure states in recover()
* feat(common-procedure): Use VecDeque for finished procedures
* fix: incorrect show create table output
* feat: change CreateTable's Display if table is external
* feat: change CreateTable's Display if table is external
* refactor: refactor copy from executor
* feat: support to copy from CSV and JSON format files
* feat: support to copy table to the CSV and JSON format file
* test: add tests copy from/to
* chore: apply suggestions from CR
* refactor: id first in pusher_key
* feat: is_acceptable for multi roles
* feat: mailbox
* fix: channel for mailbox
* feat: impl mailbox via heartbeat
* chore: add unit test for mailbox
* chore: by cr
* chore: typo
* chore: refactor the mailbox API
* chore: br cr
* chore: check timeout interval to 10ms
* chore: add response header
* feat: add support for information_schema.columns
* feat: remove information_schema from its view
* Update src/catalog/src/information_schema.rs
Co-authored-by: LFC <bayinamine@gmail.com>
* fix: error on table data type
* test: correct sqlness test for information schema
* test: add information_schema.columns sqlness tests
---------
Co-authored-by: LFC <bayinamine@gmail.com>
* test(storage): use assert_eq to check scan result
* feat(storage): Add more info to manifest log
* feat: Avoid error log when unable to delete
* fix: The manifest gc task should delete files <= last_version
* feat(storage): Don't log if the error kind is not found
* feat: Add keep_last_checkpoint option
* feat: improve and distinguish different errors for IllegalInsertData
* feat: change error code for UnexpectedValuesLength and ColumnAlreadyExists
* chore: improve readability of error message
* feat: add metrics for various interfaces
* feat: add db label for protocols
* feat: add postgres protocol metrics
* feat: add metrics for grpcs apis
* feat: add auth failure counter for mysql/pg
* fix: add db label to grpc prometheus interface
* Apply suggestions from code review
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: add error code for auth failure counter
* fix: use schema as dbname when catalog is default
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* fix(mito): Add metrics to mito DDL procedure
* feat(mito): Use procedure's implementation to create table
* feat(mito): Use procedure's implementation to alter table
* feat(mito): Use procedure's implementation to drop table
* style(mito): Fix clippy
* test(mito): Fix tests
* feat(mito): Add TableCreator
* feat(mito): update alter table procedure
* fix(mito): alter procedure create alter op first
* feat(mito): Combine alter table code
* fix(mito): Fix deadlock
* feat(mito): Simplify drop table procedure
* build: Download assets to cargo output dir
Also remove the output from the build script and only print the output
on failure
* chore: Update src/servers/build.rs
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* build: replace pushd by cd
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* feat: implement load_options.
* refactor: build by ConfigOptions.
* refactor: init_global_logging by LoggingOptions.
* chore: make clippy happy.
* refactor: use TopLevelOptions push top level options to subcommand.
* test: test TopLevelOptions.
* refactor: push Options in Box.
* refactor: push Options in Box.
* refactor: use let-else and Options.
* feat: Remove create_mock_sql_handler()
create_to_request() and alter_to_request() don't need `&self`, so
we don't need to mock the sql handler to test them
* feat: Enable procedure manager by default
* docs: Update config example
* test: Enable procedure framework in all tests
* refactor(datanode): rename methods using procedure
* test(catalog): Fix temp dir drops before test finishes
* tests: Enable procedure framework in sqlness
* test: Fix sqlness standalone rename test
* fix: Drop procedure allows table not in engine
* test: Change rename table test
* fix: add options to table meta when creating table by procedure
* test: adjust error message in schema test case
* test: Fix test_sql_api error message
* fix: frontend opt should respect http addr in config file when no command options is given
* refactor: command line options should be Option<bool>
* fix: ci
* feat: support creating the physical plan for JSON and CSV files
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* refactor(file-table-engine): use datasource Format instead
* refactor: Add table_ref() to requests as their methods
* feat: Add CreateImmutableFileTable
* feat: Add DropImmutableFileTable
* feat: Implement TableEngineProcedure for ImmutableFileTableEngine
* feat: Add common-procedure-test crate
* refactor: mito engine use common-procedure-test to test procedures
* test: Add test for create and drop table
* chore: Address review comments
* feat: Add drop table procedure
* feat: support dropping table by procedure on datanode
* test: Add test for DropTableProcedure
* test: Test drop table by procedure
* chore: update comments
* fix: Make on_remove_from_catalog idempotent
* feat: add table id and engine to informatin_schema.TABLES
* Update src/catalog/src/information_schema/tables.rs
Co-authored-by: Yingwen <realevenyag@gmail.com>
* chore: change table_engine to engine
* test: update sqlness for information schema
* test: update information_schema test in frontend::tests::instance_test.rs
* fix: github action sqlness information_schema test fail
* test: ignore table_id in information_schema
* test: support distribute and standalone have different output
---------
Co-authored-by: Yingwen <realevenyag@gmail.com>
* Add the cache hit/miss counter
* Verify the cache metrics are included
* Resolve comments
* Rename the error kind label name to be consistent with other metrics
* Rename the object store metric names
* Avoid using glob imports
* Format the code
* chore: Update src/object-store/src/metrics.rs mod doc
---------
Co-authored-by: Yingwen <realevenyag@gmail.com>
* chore: add some metrics for grpc client
* chore: add grpc preix and change metrics-exporter-ptometheus to add global prefix
---------
Co-authored-by: paomian <qtang@greptime.com>
* ci: set whether it is the latest release by using 'ncipollo/release-action'
* ci: modify greptimedb install script to use the latest nightly version binary
* feat: implement predict_linear function in promql
* feat: initialize predict_linear's planner
* fix(bug): fix a bug in linear regression and add some unit test for linear regression
* chore: format code
* feat: deal with NULL value in linear_regression
* feat: add test for all value is None
* feat(promql): add holt_winters initial implementation
* feat(promql): improve docs for holt_winters
* feat(promql): adjust holt_winters implementation according to code review
* feat(promql): add holt_winters test from prometheus promql function test suite
* feat(promql): add holt_winters more tests from prometheus promql function test suite
* feat(promql): fix styling issue
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* fix: not allowed to create column name same with keyword without quoted
* fix: tests
* Update src/sql/src/parsers/create_parser.rs
Co-authored-by: Ning Sun <classicning@gmail.com>
* fix: tests
---------
Co-authored-by: Ning Sun <classicning@gmail.com>
* feat(compaction_time_window): initial changes for compaction_time_window field support
* feat(compaction_time_window): move PickerContext creation
* feat(compaction_time_window): update region descriptor, fix formatting
* feat(compaction_time_window): add minor enhancements
* feat(compaction_time_window): fix failing test
* feat(compaction_time_window): return an error instead silently skip for the user provided compaction_time_window
* feat(compaction_time_window): add TODO reminder
* wip: use
* rebase develop
* chore: fix typos
* feat: replace export parquet writer with buffered writer
* fix: some cr comments
* feat: add sst_write_buffer_size config item to config how many bytes to buffer before flush to underlying storage
* chore: reabse onto develop
* chore: add http metrics server in datanode node when greptime start in distributed mode
* chore: add some docs and license
* chore: change metrics_addr to resolve address already in use error
* chore add metrics for meta service
* chore: replace metrics exporter http server from hyper to axum
* chore: format
* fix: datanode mode branching error
* fix: sqlness test address already in use and start metrics in defualt config
* chore: change metrics location
* chore: use builder pattern to builder httpserver
* chore: remove useless debug_assert macro in httpserver builder
* chore: resolve conflicting build error
* chore: format code
* feat: wip
* fix: Fix CreateMitoTable::table_schema not initialized from json
* feat: Implement AlterMitoTable procedure
* test: Add test for alter procedure
* feat: Register alter procedure
* fix: Recover procedures after catalog manager is started
* feat: Simplify usage of table schema in create table procedure
* test: Add rename test
* test: Add drop columns test
* test: Add compaction test
* test: Test read during compaction
* test: Add s3 object store to test
* test: only run compact test
* feat: Hold file handle in chunk stream
* test: check files still exist after compact
* feat: Revert changes to develop.yaml
* test: Simplify MockPurgeHandler
* feat: add dbname and health check for grpc api
* refactor: move health check to dedicated service
* chore: switch to merged proto rev
* feat: implement healthcheck on server-side
* fix: returns None if parquet file does not contain any rows
* fix: skip empty parquet file
* chore: add doc
* rebase develop
* fix: use flatten instead of filter_map with identity
* feat(to_unixtime): add initial implementation
* feat(to_unixtime): use Timestamp for conversion
* feat(to_unixtime): implement conversion to Result<VectorRef>
* feat(to_unixtime): make unit test pass
* feat(to_unixtime): preserve None for invalid timestamps
* feat(to_unixtime): address code review suggestions
* feat(to_unixtime): add an sqlness test
* feat(to_unixtime): adjust the assertion for the sqlness test
* Update tests/cases/standalone/common/select/dummy.sql
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* ci: install python3 and python3-dev in CI Dockerfile
* ci: release the standalone binaries with pyo3 support for multiple platforms
* refactor: install pip and pyarrow
* refactor: specify the python version
* fix: use correct env var
* fix: move COPY up so rustup know it's nightly
* fix: add `pyo3_backend` in GHA yml
* chore: name for `TODO`
* temp: not set `pyo3_backend` before find DSO
* fix: release linux with pyo3_backend
* feat: implement promql query on grpc
* test: resolve test errors
* test: add tests for promql grpc api
* refactor: align prom object name with proto
* chore: switch proto revision to main
* fix: make pyo3 optional again
* Update src/script/Cargo.toml
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat(SST): use a newType named FileId for FileMeta
* chore: rename some functions
* fix: compatible for previous FileMeta format
* fix: alias for file_id when getting deserialized
* feat: mysql prepare by replace ? in sql to param
* chore: mysql prepare statment support time param
* chore: prepare test more types
* chore: add TODO
* feat: add take index method for VectorOp
* chore: make clippy happy
* chore: make clippy happy
* chore: improve the code
* chore: improve the code
* chore: add take null test
* chore: fix clippy
* feat: track disk usage of regions
Signed-off-by: Zheming Li <nkdudu@126.com>
* calculate disk usage when call
* add default on file meta
---------
Signed-off-by: Zheming Li <nkdudu@126.com>
* refactor: change from urlencoded to regex
* refactor: change from urlencoded to regex
* chore: add unit test
* chore: update comment
* chore: remove local benchmark test
* chore: minor fix
* chore: remove unused dep
* docs(contributingmd): add run tests commands
* docs(contributingmd): add link to nextest website
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* fix: apply ttl and write_buffer_size options when a table is created via procedure
* fix: address code review suggestion
* fix: use borrowing of table_options correctly
* docs: Add comments to standalone config example
* docs: Add comments to datanode config example
* docs: Add comments to frontend config example
* docs: Add comments to meta-srv config example
* docs: Use "GB" instead of "GiB"
* docs: Add link to the selector doc
* docs: Fix grammar
* docs: Change comment position
* refactor(procedure): Store error in ProcedureState
* test: Mock instance with procedure enabled
* feat: Add wait method to wait for procedure
* test(datanode): Test create table by procedure
* chore: Fix clippy
* fix: using schema instead of full database
* fix: using schema instead of full database
* fix: using schema instead of full database
* chore: add debug log
* chore: remove debug log
* chore: remove debug log
* chore: fix cr
* fix: Serialize FrontendOptions to toml
* fix: Serialize DatanodeOptions to toml
* fix: Serialize StandaloneOptions to toml
See https://users.rust-lang.org/t/why-toml-to-string-get-error-valueaftertable/85903/2
* chore!: Rename MetaClientOpts to MetaClientOptions
BREAKING CHANGE: Change the meta_client_opts in the config file to
meta_client_options
* chore: get header in grpc & temp save
* chore: change authscheme to include data str
* chore: add auth to grpc flight handler
* chore: add unit test & hold for now since grpc api doesnt accept req input
* chore: minor change
* chore: minor change
* chore: add flight context to database interface
* chore: add test
* chore: update proto version & fix cr issue
* chore: add test
* chore: minor update
* feat: catalog list
* feat: catalog list
* feat:api
* feat: leader info
* feat: use constant
* fix: ci
* feat: query heartbeat by ip
* ut: add test
* fix: cr
* fix: cr
* fix: cr
* refactor: Use watch channel to store ProcedureState
* feat: Add a watcher to wait for state change
* test: test watcher on procedure failure
* feat: Only clear message cache on success
* feat: submit returns Watcher
* feat: change table options from string map to a struct, add ttl and write_buffer_size
* fix: also pass table options to table meta
* feat: pass table options when opening/creating regions
* fix: CR comments
* feat: wip
* feat: Implement procedure to create mito table
* feat: Add create_table_procedure to TableEngine
* feat: Impl dump and lock for CreateMitoTable
* feat: Impl CreateMitoTable::execute and register it to manager
* feat(common-procedure): pub local mod
* feat: Add simple test for MitoCreateTable
* style: Fix clippy
* refactor: Move create_table_procedure to a new trait TableEngineProcedure
* initial impl
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* minor (useless) refactor
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* retrieve metric name
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* add time index column to group by columns
filter out NaN in normalize
remove NULL in instant manipulator
accept form data as HTTP params
correct API URL
accept second literal as step param
* happy clippy
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* update test result
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
---------
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* refactor: make schedule request return value generic
* feat: add handler trait
* wip
* feat: use task handler
* fix: unit test
* refactor: separate scheduler mod
* chore: rename
* chore: Request use associate type
* refactor: use associate type
* refactor: use associate type to reduce generic parameters
* chore: further remove generic types
* chore: further remove a generic parameter
* feat: make args in coprocessor optional
* feat: supports kwargs for coprocessor as params passed by the users
* feat: supports params for /run-script
* fix: we should rewrite the coprocessor by removing kwargs
* fix: remove println
* fix: compile error after rebasing
* fix: improve http_handler_test
* test: http scripts api with user params
* refactor: tweak all to_owned
* feat: Add ContextProvider to Context
So procedures can query states of other procedures via the
ContextProvider and they don't need to hold a ProcedureManagerRef
* feat: Procedure supports acquring multiple lock keys
* test: Use multi-locks in test
* feat: Add keys_to_lock/unlock
* fix: correct date/time type format for postgresql
* fix: tests for timestamp
* refactor: use Utc datetime for timestamp::to_chrono_datetime
* Update src/servers/Cargo.toml
Co-authored-by: LFC <bayinamine@gmail.com>
---------
Co-authored-by: LFC <bayinamine@gmail.com>
* feat: trigger compaction on flush
* chore: rebase develop
* feat: add config item max_file_in_level0 and remove compaction_after_flush
* fix: cr comments
* chore: add unit test to cover Timestamp::new_inclusive
* fix: workaround to fix future is not Sync
* fix: future is not sync
* fix: some cr comments
* add DistLock trait and a implement based etcd
wip
impl lock grpc service for meta-srv
reuse the etcd client instead of repeatedly creating etcd client
add some docs and comments
add some comment
meta client support distribute lock
fix: dead lock
self-cr
* cr
* rename "expire" -> "expire_secs"
* feat: Runner executes procedure
* feat: Add rollback key type to ParsedKey
* feat: Write rollback key when procedure is unable to execute
* feat: Use loaded step to re-submit subprocedure
* feat: Track subprocedures in ProcedureMeta
* feat: Clean message cache after the root procedure is done
* feat: Runner returns execution result
* fix: Fix tests
* test: Test Runner
* test: Test procedures_in_tree
* chore: Refine test and comments
* feat: Remove support of lock inheritance
A deadlock happens if a subprocedure acquires the same lock key as
its parent.
The main concern is if the subprocedure directly inherits its parent's
lock, then how should we behave when multiple subprocedures acquire
this same lock? Each procedure may assume it has unique access to the
same object but it actually shares the resource with others.
Now subprocedures need to use different keys to lock objects, which is
reasonable. For example:
- A parent procedure wants to create a table so it locks the table with
a key like `catalog.schema.table`
- Subprocedures create regions for the table so they lock the regions
with keys `catalog.schema.table.region-0 ~ catalog.schema.table.region-n`
* style: Fix clippy
* feat: insert_procedure returns false on duplicate procedure
Also rename this method to try_insert_procedure
* chore: Address CR comments
* feat: implement "drop table" in distributed mode (both in SQL and gRPC)
refactor: create distributed table
some details:
- set table global value in Meta, as well as table routes value. Datanode only set table regional value
- complete instance SQL tests both in standalone and distributed mode
* fix: rebase develop
* fix: resolve PR comments
* feat: enable caching when using object store
* feat: support file cache for object store
* feat: maintaining the cached files with lru
* fix: improve the code
* empty commit
* improve the code
* wip: fix compile errors
* chore: move splitter to partition crate
* fix: remove useless variants in frontend errors
* chore: move more partition related code to partition manager
* fix: license header
* wip: move WriteSplitter to PartitionRuleManager
* fix: clippy warnings
* chore: remove useless error variant and format toml
* fix: cr comments
* chore: resolve conflicts
* chore: rebase develop
* fix: cr comments
* feat: support multi regions on datanode
* chore: rebase onto develop
* chore: rebase develop
* chore: rebase develop
* wip
* fix: compile errors
* feat: multi region
* fix: CR comments
* feat: allow stat existing regions without actually open it
* fix: use table meta in manifest to recover region info
* docs: Add procedure framework RFC
* docs: Add dump, rollback and locking to procedure framework
* docs: Change ProcedureBuilder to ProcedureLoader
* docs: Add sub-procedures section
* docs: Add a link to explain idempotent
* docs: Add link to the tracking issue
* docs: Fix ProcedureLoader type alias
* docs: Update procedure API
* docs: Address CR comments
* docs: Update path and make the docs more clear
* chore: update opensrv-mysql to main
* refactor: change mysql server struct
* feat: add option to reject no database mysql connection request
* chore: remove unused condition
* chore: rebase develop
* chore: make reject_no_database optional
* 1. Reimplement Eq for Timestamp
2. Add and/or for GenericRange
* feat: extract time range from filters
* feat: select sst files according to time range
* fix: clippy
* fix: empty value in range
* fix: some cr comments
* fix: return optional timestamp range
* fix: cr comments
* fix: wrong error info
* add derive hash for StatKey
* add a attrs field in Context
* add load_based selector
* add license
* make Nodestat module public
* add meta startup config item about selector
* cr: remove attrs, add concrete type in context
* cr: change region_number type to Option<u64>
* cr: add comment in example.toml
* cr
* fix: parsing time index column option
* test: adds more cases for creating table
* chore: by CR comments
* feat: validate time index constraint in parser
* chore: improve error msg
* feat: support renaming table in the catalog manger
* feat: implement rename table for local catalog manager
* chore: fmt code
* fix: update system catalog when renaming table in local catalog manager
* chore: add instance test for rename table
* chore: fix frontend test
* chore: fix comment
* chore: fix rename table test
* fix: renaming a table with an existing name
* fix: improve the system catalog's renaming process
* chore: improve the code
* chore: improve the comment
Co-authored-by: Yingwen <realevenyag@gmail.com>
* chore: improve the code
* chore: fix tests
* chore: fix instance_test
* chore: improve the code
Co-authored-by: Yingwen <realevenyag@gmail.com>
If the table has a non-null column, we need to use default value instead
of null to fill the value columns in the record batch for deletion.
Otherwise, we can't create the record batch since the schema check
doesn't allow null in the non-null column.
* feat: add insert test cases
* fix: update results after rebase develop
* feat: supports unsigned integer types and big_insert test
* test: add insert_invalid test
* feat: supports time index constraint for bigint type
* chore: time index column at last
* test: adds more order, limit test
* fix: style
* feat: adds numbers table in standable memory catalog mode
* feat: enable fail_fast and test_filter in sqlness
* feat: add more tests
* fix: test_filter
* test: add alter tests
* feat: supports if_not_exists when create database
* test: filter_push_down and catalog test
* fix: compile error
* fix: delete output file
* chore: ignore integration test output in git
* test: update all integration test results
* fix: by code review
* chore: revert .gitignore
* feat: sort the show tables/databases results
* chore: remove issue link
* fix: compile error and code format after rebase
* test: update all integration test results
* chore: minor change on election
* chore: refactor some from/into
* feat: add in_memory store for leader node
* refactor: make context mutable
* feat: add ResetableKvStore trait
* feat: dn support report number of regions to meta
* put the heartbeat batch to store
* cr: change region_number's parameter to &CatalogManagerRef
* cr: when dn failed to get region number, report region_num = -1 to meta
* feat: add catalog name resolution for postgres and http interface
* test: add tests for catalog resolution on http and postgres
* feat: assign custom catalog for query
* chore: order code for better readability
* refactor: make influxdb, opentsdb and prometheus read/write goes through GRPC interface, to unify and simplify the Frontend instance either in standalone or distributed mode
* docs: Fix incorrect comment of Vector::only_null
* feat: Add delete to WriteRequest and WriteBatch
* feat: Filter deleted rows
* fix: Fix panic after reopening engine
This is detected by adding a reopen step to the delete test for region.
* fix: Fix OpType::min_type()
* test: Add delete absent key test
* chore: Address CR comments
* fix: carry not recordbatch result in FlightData, to allow executing SQLs other than selection in new GRPC interface
* Update src/datanode/src/instance/flight/stream.rs
Co-authored-by: Jiachun Feng <jiachun_feng@proton.me>
* chore: minor openup
* chore: open up auth_mysql and return ()
* chore: typo change
* chore: change according to ci
* chore: change according to ci
* chore: remove tonic status in auth error
* chore: Remove unused MutationExtra
* refactor(storage): Refactor Mutation and Payload
Change Mutation from enum to a struct that holds op type and record
batches so the encoder don't need to convert the mutation into record
batch. Now The Payload is no more an enum, it just holds the data, to
be serialized to the WAL, of the WriteBatch. The encoder and decoder
now deal with the Payload instead of the WriteBatch, so we could hold
more information not necessary to be stored to the WAL in the
WriteBatch.
This commit also merge variants in write_batch::Error to storage::Error
as some variants of them denote the same error.
* test(storage): Pass all tests in storage
* chore: Remove unused codes then format codes
* test(storage): Fix test_put_unknown_column test
* style(storage): Fix clippy
* chore: Remove some unused codes
* chore: Rebase upstream and fix clippy
* chore(storage): Remove unused codes
* chore(storage): Update comments
* feat: Remove PayloadType from wal.proto
* chore: Address CR comments
* chore: Remove unused write_batch.proto
* chore: upgrade to Arrow 29.0 and use workspace package and dependencies
* fix: resolve PR comments
Co-authored-by: luofucong <luofucong@greptime.com>
* feat: check database existance on http api
* Update src/servers/src/http/handler.rs
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* feat: use database not found status code
* test: add assertion for status code
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* feat: add MemUserProvider and impl auth
* feat: impl user_provider option in fe and standalone mode
* chore: add file impl for mem provider
* chore: remove mem opts
* chore: minor change
* chore: refac pg server to use user_provider as indicator for using pwd auth
* chore: fix test
* chore: extract common code
* chore: add unit test
* chore: rebase develop
* chore: add user provider to http server
* chore: minor rename
* chore: change to ref when convert to anymap
* chore: fix according to clippy
* chore: remove clone on startcommand
* chore: fix cr issue
* chore: update tempdir use
* chore: change TryFrom to normal func while parsing anymap
* chore: minor change
* chore: remove to_lowercase
* feat: add Makefile to aggregate the commands that developers always use
* refactor: add 'clean' and 'unit-test' target
* refactor: add sqlness-test target and modify some decriptions format
Signed-off-by: zyy17 <zyylsxm@gmail.com>
* feat: use Substrait logical plan to query data from Datanode in Frontend in distributed mode
* fix: resolve PR comments
* fix: resolve PR comments
* fix: resolve PR comments
Co-authored-by: luofucong <luofucong@greptime.com>
* refactor: update RsPy
* depend: add `rustpython-pylib`
* feat: add_frozen stdlib for every vm init
* feat: limit stdlib to a selected few
* chore: use `rev` instead of branch` im depend
* refactor: rename to allow_list
* feat: use opt level one
* doc: add username for TODO&change optimize to 0
* style: fmt .toml
* feat: adds s3 object storage configuration
* feat: adds s3 integration test
* chore: use map
* fix: forgot license header
* fix: checking if bucket is empty in test_on
* chore: address CR issues
* refactor: run s3 test with dotenv
* chore: randomize grpc port for test
* fix: README in tests-integration
* chore: remove redundant comments
* feat: mysql and pg server support tls
* chore: replace opensrv-mysql to original
* chore: TlsOption is required but supply default value
* feat: mysql server support force tls
* chore: move TlsOption to servers
* test: mysql server disable / prefer / required tls mode
* test: pg server disable / prefer / required tls mode
* chore: add doc and remove no used code
* chore: add TODO and restore cargo linker config
* chore: fix SequenceNotMonotonic error message
previous sequence should greater than or equal to given sequence
* Apply suggestions from code review
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* deps: Bump OpenDAL to v0.21.1
Signed-off-by: Xuanwo <github@xuanwo.io>
* Avoid using raw types when not needed
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
* refactor: options and sample configurations
* chore: newline at end of file
* chore: format code
* chore: remove comment and set sample configurations to default values
* chore: use single quoted string in sample configuration files
* doc: fix spelling, minor grammar mistakes
also provided alternatives for "with transparent experience from users' perspective"
alternatives:
1. provide users with transparency
2. provide a transparent experience for all users
3. transparent to users from all perspectives
* docs: apply suggestions from code review
Co-authored-by: xiaomin tang <xtang@users.noreply.github.com>
* fix: table conflicts in different database, #483
* feat: support db query param in prometheus remoting read/write
* feat: support db query param in influxdb line protocol
* fix: make schema_name work in gRPC
* fix: table data path
* fix: table manifest dir
* feat: adds opendal logging layer to object store
* Update src/frontend/src/instance.rs
Co-authored-by: LFC <bayinamine@gmail.com>
* Update src/frontend/src/instance.rs
Co-authored-by: LFC <bayinamine@gmail.com>
* Update src/servers/src/line_writer.rs
Co-authored-by: Lei, Huang <6406592+v0y4g3r@users.noreply.github.com>
* Update src/servers/src/line_writer.rs
Co-authored-by: Lei, Huang <6406592+v0y4g3r@users.noreply.github.com>
* fix: compile error
* ci: use larger runner for running coverage
* fix: address already in use in test
Co-authored-by: LFC <bayinamine@gmail.com>
Co-authored-by: Lei, Huang <6406592+v0y4g3r@users.noreply.github.com>
* feat: make sql api output a vector to support multi-statement
* feat: add execution_time_ms to http sql and script api
* fix: use u128 for execution time
* Apply suggestions from code review
Co-authored-by: Yingwen <realevenyag@gmail.com>
* fix: lint error
Co-authored-by: Yingwen <realevenyag@gmail.com>
* fix: dedup should not mark element as unneeded
It should only mark element as selected, because some column of
different rows may have same value.
* refactor: Rename dedup to find_unique
As the original `dedup` method only mark bitmap to true when it finds
the element is unique, so `find_unique` is more appropriate for its
name.
* test: Renew bitmap in test_batch_find_unique
* chore: Update comments
* fix: Also handles admin request in another runtime
* chore: Describe why executes admin request in another runtime
* test: Enable test_insert_and_select
* refactor: dependency, from frontend depends on datanode to datanode depends on frontend
* wip: start frontend in datanode
* wip: migrate create database to frontend
* wip: impl alter table
* fix: CR comments
* feat: add table id and region ids field to CreateExpr
* chore: rebase develop
* refactor: frontend catalog should set from datanode
* feat: gRPC AddColumn request support add multi columns
* wip: move create table and create-on-insertion to frontend
* wip: error handling
* fix: some unit tests
* fix: all unit tests
* chore: merge develop
* feat: add create/alter-on-insertion to dist_insert/sql_dist_insert
* fix: add region number/catalog/schema to InsertExpr
* feat: add handle_create_table/handle_create_database...
* fix: remove catalog from insert expr
* fix: CR comments
* fix: when running in standalone mode, mysql opts and postgres opts should pass to frontend so that auctually running service can change the port to listen on
* refactor: add a standalone subcommand, move frontend start stuff to cmd package
* chore: optimize create table failure logs
* docs: change readme
* docs: update readme
* refactor: dependency, from frontend depends on datanode to datanode depends on frontend
* wip: start frontend in datanode
* wip: migrate create database to frontend
* wip: impl alter table
* fix: CR comments
* feat: add server_version as postgresql jdbc connector requires
* refactor: do not require password at the moment
* fix: correct datetime output as required by postgresql
* docs: corrected timestamp on our readme
* refactor: simplify import
* fix: address review issues
* feat: move time index metadata from schema into field
* chore: remove useless code
* test: test select with column alias
* fix: conflicts with develop branch
* test: add test
* test: order by timestamp to ensure query results order
* fix: comment
* ci: Upgrade rust-cache to v2.2.0
v2.0.0 uses API that is deprecated
* ci: Use --workspace in cargo llvm-cov
* ci: Replace actions-rs/toolchain by dtolnay/rust-toolchain
actions-rs/toolchain is under inactive maintenance, it uses node12 that
would soon becomes deprecated
* ci: Replace actions-rs/cargo by run
* ci: rust-cache and cleanup-disk-action try not to specific full version
* ci: Use nextest
Also sets timeout for nextest to avoid a test hanging too long
* ci: Upgrade actions/checkout to v3
To upgrade node from 12 to 16
* ci: Specific cleanup-disk-action version
* feat: supports list array in arrow_array_get
* feat: supports string and list type conversions in python coprocessor
* test: add test cases for returning list in coprocessor
* fix: Fix int64 type not considered in DEFAULT CURRENT_TIMESTAMP() constraint
Also avoid using `ConstantVector` in default constraint, as other user
may try to downcast it to a concrete type, and sometimes may forget to
check whether it is a constant vector.
* test: Add test for writing default value
* add line_writer and convert insert_stmt to InsertRequest
* support convert influxdb line protocol to InsertRequest
* support convert opentsdb to InsertRequest
* cr
* feat: Support removing columns from mito table
Implements drop column for mito table engine, and adjusts the execution
order of altering table, persists the table manifest first, then alter
the schema of the region.
* feat(storage): Remove duplicate table_info() impl
Table already provides a table_info() now, some downcast in tests are
also no longer needed.
* test: Add tests for add/remove columns
* style(table): Fix clippy
* fix: Find timestamp index by its column name
Previous implementation updates the timestamp index too early, which
would cause the index check that compare the index to remove with
timestamp index failed.
* chore: Remove generated comment in Cargo.toml
* chore: Rename alter to builder_with_alter_kind
* refactor: Alloc new column from TableMeta
* style: Fix clippy
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* refactor: remove remote catalog related files
* fix: compilation
* feat: add table version to TableKey
* feat: add node id to TableValue
* fix: some CR comments
* chore: change async fn create_expr_to_request to sync
* fix: add backtrace to errors
* fix: code style
* refactor: merge refactor/catalog-crate
* feat: table key with version
* feat: impl KvBackend for MetaClient
* fix: integrate metaclient
* fix: catalog use local table info as baseline
* fix: sync metsrv
* fix: wip
* fix: update remote catalog on register and deregister
* refactor: CatalogProvider
* refactor: CatalogManager
* fix: catalog key filtering
* fix: pass some test
* refactor: catalog iterating
* fix: CatalogManager::table also requires both catalog_name and schema_name
* chore: merge develop
* chore: merge catalog crate
* fix: adapt to recent meta-client api change
* feat: databode lease
* feat: remote catalog (#356)
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* refactor: remove remote catalog related files
* fix: compilation
* feat: add table version to TableKey
* feat: add node id to TableValue
* fix: some CR comments
* chore: change async fn create_expr_to_request to sync
* fix: add backtrace to errors
* fix: code style
* refactor: merge refactor/catalog-crate
* feat: table key with version
* feat: impl KvBackend for MetaClient
* fix: integrate metaclient
* fix: catalog use local table info as baseline
* fix: sync metsrv
* fix: wip
* fix: update remote catalog on register and deregister
* refactor: CatalogProvider
* refactor: CatalogManager
* fix: catalog key filtering
* fix: pass some test
* refactor: catalog iterating
* fix: CatalogManager::table also requires both catalog_name and schema_name
* chore: merge develop
* chore: merge catalog crate
* fix: adapt to recent meta-client api change
* feat: datanode heartbeat (#355)
* feat: add heartbeat task to instance
* feat: add node_id datanode opts
* fix: use real node id in heartbeat and meta client
* feat: distribute table in frontend
* test: distribute read demo
* test: distribute read demo
* test: distribute read demo
* add write spliter
* fix: node id changed to u64
* feat: datanode uses remote catalog implementation
* dist insert integrate table
* feat: specify region ids on creating table (#359)
* fix: compiling issues
* feat: datanode lease (#354)
* Some glue code about dist_insert
* fix: correctly wrap string value with quotes
* feat: create route
* feat: frontend catalog (#362)
* feat: integrate catalog to frontend
* feat: preserve partition rule on create
* fix: print tables on start
* chore: log in create route
* test: distribute read demo
* feat: support metasrv addr command line options
* feat: optimize DataNodeInstance creation (#368)
* chore: remove unnecessary changes
* chore: revert changes to src/api
* chore: revert changes to src/datanode/src/server.rs
* chore: remove opendal backend
* chore: optimize imports
* chore: revert changes to instance and region ids
* refactor: MetaKvBackend range
* fix: remove some wrap
* refactor: initiation of catalog
* feat: add region id to create table request and add heartbeat task to datanode instance
* fix: fix auto reconnect for heartbeat task
* chore: change TableValue::region_numbers to vec<u32>.
* fix: some tests
* fix: avoid concurrently start Heartbeat task by compare_exchange
* feat: refactor catalog key and values, separate table info into two kinds of keys
* feat: bump table id from metasrv
* fix: compare and set table id
* chore: merge develop
* fix: use integer serialization instead of string serialization
Co-authored-by: jiachun <jiachun_fjc@163.com>
Co-authored-by: luofucong <luofucong@greptime.com>
Co-authored-by: fys <1113014250@qq.com>
Co-authored-by: Jiachun Feng <jiachun_feng@proton.me>
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* refactor: remove remote catalog related files
* fix: compilation
* feat: add table version to TableKey
* feat: add node id to TableValue
* fix: some CR comments
* chore: change async fn create_expr_to_request to sync
* fix: add backtrace to errors
* fix: code style
* refactor: merge refactor/catalog-crate
* feat: table key with version
* feat: impl KvBackend for MetaClient
* fix: integrate metaclient
* fix: catalog use local table info as baseline
* fix: sync metsrv
* fix: wip
* fix: update remote catalog on register and deregister
* refactor: CatalogProvider
* refactor: CatalogManager
* fix: catalog key filtering
* fix: pass some test
* refactor: catalog iterating
* fix: CatalogManager::table also requires both catalog_name and schema_name
* chore: merge develop
* chore: merge catalog crate
* fix: adapt to recent meta-client api change
* feat: databode lease
* feat: remote catalog (#356)
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* refactor: remove remote catalog related files
* fix: compilation
* feat: add table version to TableKey
* feat: add node id to TableValue
* fix: some CR comments
* chore: change async fn create_expr_to_request to sync
* fix: add backtrace to errors
* fix: code style
* refactor: merge refactor/catalog-crate
* feat: table key with version
* feat: impl KvBackend for MetaClient
* fix: integrate metaclient
* fix: catalog use local table info as baseline
* fix: sync metsrv
* fix: wip
* fix: update remote catalog on register and deregister
* refactor: CatalogProvider
* refactor: CatalogManager
* fix: catalog key filtering
* fix: pass some test
* refactor: catalog iterating
* fix: CatalogManager::table also requires both catalog_name and schema_name
* chore: merge develop
* chore: merge catalog crate
* fix: adapt to recent meta-client api change
* feat: datanode heartbeat (#355)
* feat: add heartbeat task to instance
* feat: add node_id datanode opts
* fix: use real node id in heartbeat and meta client
* feat: distribute table in frontend
* test: distribute read demo
* test: distribute read demo
* test: distribute read demo
* add write spliter
* fix: node id changed to u64
* feat: datanode uses remote catalog implementation
* dist insert integrate table
* feat: specify region ids on creating table (#359)
* fix: compiling issues
* feat: datanode lease (#354)
* Some glue code about dist_insert
* fix: correctly wrap string value with quotes
* feat: create route
* feat: frontend catalog (#362)
* feat: integrate catalog to frontend
* feat: preserve partition rule on create
* fix: print tables on start
* chore: log in create route
* test: distribute read demo
* feat: support metasrv addr command line options
* feat: optimize DataNodeInstance creation (#368)
* chore: remove unnecessary changes
* chore: revert changes to src/api
* chore: revert changes to src/datanode/src/server.rs
* chore: remove opendal backend
* chore: optimize imports
* chore: revert changes to instance and region ids
* refactor: MetaKvBackend range
* fix: remove some wrap
* refactor: initiation of catalog
* feat: add region id to create table request and add heartbeat task to datanode instance
* fix: fix auto reconnect for heartbeat task
* chore: change TableValue::region_numbers to vec<u32>.
* fix: some tests
* fix: avoid concurrently start Heartbeat task by compare_exchange
* fix: some cr comments
* fix: fix unit tests
Co-authored-by: jiachun <jiachun_fjc@163.com>
Co-authored-by: luofucong <luofucong@greptime.com>
Co-authored-by: fys <1113014250@qq.com>
Co-authored-by: Jiachun Feng <jiachun_feng@proton.me>
* ci: use lld linker for ci
* ci: do a disk cleanup before test
* ci: add llvm cache to speedup installation
* ci: use lld linker for coverage as well
* feat: use lld for release too
* feat: port own UDF&UDAF into py copr(untest yet)
* refactor: move UDF&UDAF to greptime_builtins
* feat: support List in val2py_obj
* test: some testcases for newly added UDFs
* test: complete test for all added gpdb's own UDF
* refactor: add underscore for long func name
* feat: better error message
* fix: typo
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* refactor: remove remote catalog related files
* fix: compilation
* feat: add table version to TableKey
* feat: add node id to TableValue
* fix: some CR comments
* chore: change async fn create_expr_to_request to sync
* fix: add backtrace to errors
* fix: code style
* refactor: merge refactor/catalog-crate
* feat: table key with version
* feat: impl KvBackend for MetaClient
* fix: integrate metaclient
* fix: catalog use local table info as baseline
* fix: sync metsrv
* fix: wip
* fix: update remote catalog on register and deregister
* refactor: CatalogProvider
* refactor: CatalogManager
* fix: catalog key filtering
* fix: pass some test
* refactor: catalog iterating
* fix: CatalogManager::table also requires both catalog_name and schema_name
* chore: merge develop
* chore: merge catalog crate
* fix: adapt to recent meta-client api change
* feat: databode lease
* feat: remote catalog (#356)
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* refactor: remove remote catalog related files
* fix: compilation
* feat: add table version to TableKey
* feat: add node id to TableValue
* fix: some CR comments
* chore: change async fn create_expr_to_request to sync
* fix: add backtrace to errors
* fix: code style
* refactor: merge refactor/catalog-crate
* feat: table key with version
* feat: impl KvBackend for MetaClient
* fix: integrate metaclient
* fix: catalog use local table info as baseline
* fix: sync metsrv
* fix: wip
* fix: update remote catalog on register and deregister
* refactor: CatalogProvider
* refactor: CatalogManager
* fix: catalog key filtering
* fix: pass some test
* refactor: catalog iterating
* fix: CatalogManager::table also requires both catalog_name and schema_name
* chore: merge develop
* chore: merge catalog crate
* fix: adapt to recent meta-client api change
* feat: datanode heartbeat (#355)
* feat: add heartbeat task to instance
* feat: add node_id datanode opts
* fix: use real node id in heartbeat and meta client
* feat: distribute table in frontend
* test: distribute read demo
* test: distribute read demo
* test: distribute read demo
* add write spliter
* fix: node id changed to u64
* feat: datanode uses remote catalog implementation
* dist insert integrate table
* feat: specify region ids on creating table (#359)
* fix: compiling issues
* feat: datanode lease (#354)
* Some glue code about dist_insert
* fix: correctly wrap string value with quotes
* feat: create route
* feat: frontend catalog (#362)
* feat: integrate catalog to frontend
* feat: preserve partition rule on create
* fix: print tables on start
* chore: log in create route
* test: distribute read demo
* feat: support metasrv addr command line options
* feat: optimize DataNodeInstance creation (#368)
* chore: remove unnecessary changes
* chore: revert changes to src/api
* chore: revert changes to src/datanode/src/server.rs
* chore: remove opendal backend
* chore: optimize imports
* chore: revert changes to instance and region ids
* refactor: MetaKvBackend range
* fix: remove some wrap
* refactor: initiation of catalog
* fix: next range request start key
* fix: mock delete range
* refactor: simplify range response handling
Co-authored-by: jiachun <jiachun_fjc@163.com>
Co-authored-by: luofucong <luofucong@greptime.com>
Co-authored-by: fys <1113014250@qq.com>
Co-authored-by: Jiachun Feng <jiachun_feng@proton.me>
* refactor: Serialize Schema/TableMeta/TableInfo to raw structs
* test: Add tests for raw struct conversion
* style: Fix clippy
* refactor: SchemaBuilder::timestamp_index takes Option<usize>
So caller could chain the timestamp_index method call where there is no
timestamp index.
* style(datatypes): Chains SchemaBuilder method calls
* chore: meta mock
* chore: refacor datanode selector
* chore: create route mock test
* chore: add mock module
* chore: memory store for test
* chore: mock meta for test
* chore: ensure memorysotre has the same behavious with etcd
* chore: replace tokio lock to parking_lot
* fix: Fix TestGuard being dropped before grpc test starts
* feat: Let start and shutdown takes immutable reference to self
Also implement shutdown for GrpcServer
* feat: Implement shutdown for HttpServer
* style: Fix clippy
* chore: Add name to AlreadyStarted error
This patch changes output for our http SQL API and prepare it for our SQL editor development. Changes includes:
- includes aide for OAS 3.1 openapi documents, available at /v1/private/api.json
- simplified some of http handlers return type, use string or json directly
- created new HttpRecordsOutput type to hide internals of RecordBatch from end-user. It also tuned data structure to be friendly for application to consume
- updated response struct to use code for success or detailed error code
Residual issue #366
* feat: allow http post for our sql http api
* feat: update our http api and attempt to add openapi spec support
* test: correct test against new handler apis
* refactor: rename rows to records
* refactor: removed HttpResponse completely
* feat: add information to our openapi docs
* feat: add docs for sql interface response
* refactor: use struct to represent query so we can doc it via aide
* refactor: use arc wrapped api
* feat: add redoc UI support
* Update src/servers/src/http.rs
Co-authored-by: LFC <bayinamine@gmail.com>
* Update src/servers/src/http.rs
Co-authored-by: LFC <bayinamine@gmail.com>
* fix: address review comments
* test: update integration tests for new api output
* refactor: make prometheus http apis compatible with recent changes
* refactor: get schema from stream
* test: add test for recordbatch to json serialization
* test: add todo for a test to be fixed later
* Revert "test: add todo for a test to be fixed later"
This reverts commit a5a50c7afb.
* fix: Revert "refactor: get schema from stream"
This reverts commit 945b685556.
* chore: add todo for pending issue #366
* chore: remove fixed server url in openapi docs
* feat: include error_code in json response
* refactor: use code over success field in json response
Co-authored-by: LFC <bayinamine@gmail.com>
* feat(storage): Implement skeleton of ReadResolver
ReadResolver is used to resolve difference between schemas
* feat(storage): Add user_column_end to ReadResover
* feat(storage): Implement Batch::batch_from_parts
Used to construct Batch from parts according to the schema that user
expects to read.
* feat(storage): Compat memtable schema
* feat(storage): Compat parquet file schema
* fix(storage): ReadResolver supports projection under same schema version
Now ReadResolver takes ProjectedSchemaRef as dest schema, and checks
whether a value column is needed by the schema after projection.
* feat(storage): Check whether columns are same columns
is_source_column_readable() takes ColumnMetadata instead of
ColumnSchema, and compares their column id to check whether they are
same columns.
* refactor(storage): Use row_key_end/user_column_end in source_schema
Rename ReadResolver::is_needed to ReadResolver::is_source_needed, and
remove row_key_end/user_column_end from ReadResolver, since they should
be same as source_schema's
* chore(storage): Remove unused codes
* test(storage): Add tests for the resolver
* feat(storage): Returns error on different source and dest column names
* style(storage): Fix clippy
* refactor: Rename ReadResolver to ReadAdapter
* chore(table): Removed unused comment
* refactor: rename to is_source_column_compatible
* feat: scaffold for prometheus protocol handler
* feat: impl remote write and read for prometheus
* chore: make label matchers working in remote reading
* chore: case senstive regexp matching for labers and tweak restful api
* test: prometheus test
* test: adds test for prometheus handler and http server
* fix: typo in comment
* refactor: move snappy_compress and snappy_decompress
* fix: by code review
* fix: collect_timeseries_ids
* fix: timestamp and value column's value may be null
* feat: align_bucket support i64 and timestamp values
* feat: add Int64 to timestamp
* feat: support query i64 timestamp vector
* test: fix failling tests
* refactor: simplify some code
* fix: CR comments and add insert and query test for i64 timestamp column
* chore: refactor dir for local catalog manager
* refactor: CatalogProvider returns Result
* refactor: SchemaProvider returns Result
* feat: add kv operations to remote catalog
* chore: refactor some code
* feat: impl catalog initialization
* feat: add register table and register system table function
* refactor: add table_info method for Table trait
* chore: add some tests
* chore: add register schema test
* chore: fix build issue after rebase onto develop
* refactor: mock to separate file
* build: failed to compile
* fix: use a container struct to bridge KvBackend and Accessor trait
* feat: upgrade opendal to 0.17
* test: add more tests
* chore: add catalog name and schema name to table info
* chore: add catalog name and schema name to table info
* chore: rebase onto develop
* refactor: common-catalog crate
* refactor: remove remote catalog related files
* fix: compilation
* feat: add table version to TableKey
* feat: add node id to TableValue
* fix: some CR comments
* chore: change async fn create_expr_to_request to sync
* fix: add backtrace to errors
* fix: code style
* fix: CatalogManager::table also requires both catalog_name and schema_name
* chore: merge develop
* refactor: return PhysicalPlan in Table trait's scan method, to support partitioned execution in Frontend's distribute read
* refactor: pub use necessary DataFusion types
* refactor: replace old "PhysicalPlan" and its adapters
Co-authored-by: luofucong <luofucong@greptime.com>
Co-authored-by: Yingwen <realevenyag@gmail.com>
* chore: Update StoreSchema comment
* feat: Add metadata to ColumnSchema
* feat: Impl conversion between ColumnMetadata and ColumnSchema
We could use this feature to store the ColumnMetadata as arrow's
Schema, since the ColumnSchema could be further converted to an arrow
schema. Then we could use ColumnMetadata in StoreSchema, which contains
more information, especially the column id.
* feat(storage): Merge schema::Error to metadata::Error
To avoid cyclic dependency of two Errors
* feat(storage): Store ColumnMetadata in StoreSchema
* feat(storage): Use StoreSchemaRef to avoid cloning the whole StoreSchema struct
* test(storage): Fix test_store_schema
* feat(datatypes): Return error on duplicate meta key
* chore: Address CR comments
* refactor: Remove column_null_mask in MutationExtra
MutationExtra::column_null_mask is no longer needed as we could ensure
there is no missing column in WriteBatch.
* feat(storage): Remove MutationExtra
Just stores MutationType in the WalHeader, no longer needs MutationExtra
* refactor: add table_info method for Table trait
* feat: add table_info method to Table trait
* test: add more unit test
* fix: impl table_info for SystemTable
* test: fix failing test
* meta: meta api&client
* meta: heartbeat server init
* feat: kv store
* chore: grpc server
* chore: meta server bootstrap
* feat: heartbeat client
* feat: route for create table
* chore: a channel pool manager
* feat: route client
* feat: store client
* chore: meta_client example
* chore: change schema
* chore: unit test & by cr
* chore: refactor meta client
* chore: add unit test
* feat: Adds ColumnDefaultConstraint::create_default_vector
ColumnDefaultConstraint::create_default_vector is ported from
MitoTable::try_get_column_default_constraint_vector.
* refactor: Replace try_get_column_default_constraint_vector by create_default_vector
* style: Remove unnecessary map_err in MitoTable::insert
* feat: Adds compat_write
For column in `dest_schema` but not in `write_batch`, this method would insert a
vector with default value to the `write_batch`. If there are columns not in
`dest_schema`, an error would be returned.
* chore: Add info log to RegionInner::alter
* feat(storage): RegionImpl::write support request with old version
* feat: Add nullable check when creating default value
* feat: Validate nullable and default value
* chore: Modify PutOperation comments
* chore: Make ColumnDescriptor::is_nullable readonly and validate name
* feat: Use CompatWrite trait to replace campat::compat_write method
Adds a CompactWrite trait to support padding columns to WriteBatch:
- The WriteBatch and PutData implements this trait
- Fix the issue that WriteBatch::schema is not updated to the
schema after compat
- Also validate the created column when adding to PutData
The WriteBatch would also pad default value to missing columns in
PutData, so the memtable inserter don't need to manually check whether
the column is nullable and then insert a NullVector. All WriteBatch is
ensured to have all columns defined by the schema in its PutData.
* feat: Validate constraint by ColumnDefaultConstraint::validate()
The ColumnDefaultConstraint::validate() would also ensure the default
value has the same data type as the column's.
* feat: Use NullVector for null columns
* fix: Fix BinaryType returns wrong logical_type_id
* fix: Fix tests and revert NullVector for null columns
NullVector doesn't support custom logical type make it hard to
encode/decode, which also cause the arrow/protobuf codec of write batch
fail.
* fix: create_default_vector use replicate to create vector with default value
This would fix the test_codec_with_none_column_protobuf test, as we need
to downcast the vector to construct the protobuf values.
* test: add tests for column default constraints
* test: Add tests for CompatWrite trait impl
* test: Test write region with old schema
* fix(storage): Fix replay() applies metadata too early
The committed sequence of the RegionChange action is the sequence of the
last entry that use the old metadata (schema). During replay, we should
apply the new metadata after we see an entry that has sequence greater
than (not equals to) the `RegionChange::committed_sequence`
Also remove duplicate `set_committed_sequence()` call in
persist_manifest_version()
* chore: Removes some unreachable codes
Also add more comments to document codes in these files
* refactor: Refactor MitoTable::insert
Return error if we could not create a default vector for given column,
instead of ignoring the error
* chore: Fix incorrect comments
* chore: Fix typo in error message
* refactor:replace another axum-test-helper branch
* refactor: upgrade opendal version
* refactor: use cursor for file buffer
* refactor:remove native-tls in mysql_async
* refactor: use async block and pipeline for newer opendal api
* chore: update Cargo.lock
* chore: update dependencies
* docs: removed openssl from build requirement
* fix: call close on pipe writer to flush reader for parquet streamer
* refactor: remove redundant return
* chore: use pinned revision for our forked mysql_async
* style: avoid wild-card import in test code
* Apply suggestions from code review
Co-authored-by: Yingwen <realevenyag@gmail.com>
* style: use chained call for builder
Co-authored-by: liangxingjian <965662709@qq.com>
Co-authored-by: Yingwen <realevenyag@gmail.com>
* test(servers): OpenTSDB shutdown test cover error branch
Create connection continuously to cover some branches of error handling
in OpentsdbServer
* test(servers): Add more tests for opentsdb server
Add a test to ensure we could not connect the server after shutdown and
a test to check existing connection usage after shutdown
* feat: adds commited_sequence to RegionChange action, #281
* refactor: saving protocol action when writer version is changed
* feat: recover all region medata in manifest and replay them when replaying WAL, #282
* refactor: minor change and test recovering metadata after altering table schema
* fix: write wrong min_reader_version into manifest for region
* refactor: move up DataRow
* refactor: by CR comments
* test: assert recovered metadata
* refactor: by CR comments
* fix: comment
* fix(storage): Failure of writing manifest version won't abort applying edit
* feat(storage): Adds RegionMetadata::validate_alter to validate AlterRequest
* fix(storage): Protect write and apply region edit by version mutex
The region meta action needs previous manifest version, so we need to
use the version mutex to avoid other thread update the manifest version
during writing the action to the manifest.
* feat(storage): Implement RegionWriter::alter
RegionWriter::alter() would
1. acquire write lock first
2. then validate the alter request
3. build the new metadata by RegionMetadata::alter()
4. acquire the version lock
5. write the metadata to the manifest, which also bump the manifest
version
6. freeze mutable memtables and apply the new metadata to Version
7. write the manifest version to wal
* test(storage): Add tests for Region::alter()
* test(storage): Add tests for RegionMetadata::validate_alter
* chore(storage): Modify InvalidAlterRequest error msg
* chore: Adjust comment
2022-10-08 20:41:04 +08:00
1166 changed files with 160804 additions and 32891 deletions
- name:Configure scheduled build version# the version would be ${SCHEDULED_BUILD_VERSION_PREFIX}-${SCHEDULED_PERIOD}-YYYYMMDD, like v0.2.0-nigthly-20230313.
Much appreciate for your interest in contributing to GreptimeDB! This document list some guidelines for contributing to our code base.
Thanks a lot for considering contributing to GreptimeDB. We believe people like you would make GreptimeDB a great product. We intend to build a community where individuals can have open talks, show respect for one another, and speak with true ❤️. Meanwhile, we are to keep transparency and make your effort count here.
To learn about the design of GreptimeDB, please refer to the [design docs](https://github.com/GrepTimeTeam/docs).
Read the guidelines, and they can help you get started. Communicate with respect to developers maintaining and developing the project. In return, they should reciprocate that respect by addressing your issue, reviewing changes, as well as helping finalize and merge your pull requests.
## Pull Requests
Follow our [README](https://github.com/GreptimeTeam/greptimedb#readme) to get the whole picture of the project. To learn about the design of GreptimeDB, please refer to the [design docs](https://github.com/GrepTimeTeam/docs).
## Your First Contribution
It can feel intimidating to contribute to a complex project, but it can also be exciting and fun. These general notes will help everyone participate in this communal activity.
- Follow the [Code of Conduct](https://github.com/GreptimeTeam/greptimedb/blob/develop/CODE_OF_CONDUCT.md)
- Small changes make huge differences. We will happily accept a PR making a single character change if it helps move forward. Don't wait to have everything working.
- Check the closed issues before opening your issue.
- Try to follow the existing style of the code.
- More importantly, when in doubt, ask away.
Pull requests are great, but we accept all kinds of other help if you like. Such as
- Write tutorials or blog posts. Blog, speak about, or create tutorials about one of GreptimeDB's many features. Mention [@greptime](https://twitter.com/greptime) on Twitter and email info@greptime.com so we can give pointers and tips and help you spread the word by promoting your content on Greptime communication channels.
- Improve the documentation. [Submit documentation](http://github.com/greptimeTeam/docs/) updates, enhancements, designs, or bug fixes, and fixing any spelling or grammar errors will be very much appreciated.
- Present at meetups and conferences about your GreptimeDB projects. Your unique challenges and successes in building things with GreptimeDB can provide great speaking material. We'd love to review your talk abstract, so get in touch with us if you'd like some help!
- Submit bug reports. To report a bug or a security issue, you can [open a new GitHub issue](https://github.com/GrepTimeTeam/greptimedb/issues/new).
- Speak up feature requests. Send feedback is a great way for us to understand your different use cases of GreptimeDB better. If you want to share your experience with GreptimeDB, or if you want to discuss any ideas, you can start a discussion on [GitHub discussions](https://github.com/GreptimeTeam/greptimedb/discussions), chat with the Greptime team on [Slack](https://greptime.com/slack), or you can tweet [@greptime](https://twitter.com/greptime) on Twitter.
## Code of Conduct
Also, there are things that we are not looking for because they don't match the goals of the product or benefit the community. Please read [Code of Conduct](https://github.com/GreptimeTeam/greptimedb/blob/develop/CODE_OF_CONDUCT.md); we hope everyone can keep good manners and become an honored member.
## License
GreptimeDB uses the [Apache 2.0 license](https://github.com/GreptimeTeam/greptimedb/blob/master/LICENSE) to strike a balance between open contributions and allowing you to use the software however you want.
## Getting Started
### Submitting Issues
- Check if an issue already exists. Before filing an issue report, see whether it's already covered. Use the search bar and check out existing issues.
- File an issue:
- To report a bug, a security issue, or anything that you think is a problem and that isn't under the radar, go ahead and [open a new GitHub issue](https://github.com/GrepTimeTeam/greptimedb/issues/new).
- In the given templates, look for the one that suits you.
- If you bump into anything, reach out to our [Slack](https://greptime.com/slack) for a wider audience and ask for help.
- What happens after:
- Once we spot a new issue, we identify and categorize it as soon as possible.
- Usually, it gets assigned to other developers. Follow up and see what folks are talking about and how they take care of it.
- Please be patient and offer as much information as you can to help reach a solution or a consensus. You are not alone and embrace team power.
### Before PR
- Make sure all unit tests are passed.
-Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings -D clippy::print_stdout -D clippy::print_stderr`).
-To ensure that community is free and confident in its ability to use your contributions, please sign the Contributor License Agreement (CLA) which will be incorporated in the pull request process.
- Make sure all your codes are formatted and follow the [coding style](https://pingcap.github.io/style-guide/rust/).
- Make sure all unit tests are passed (using `cargo test --workspace` or [nextest](https://nexte.st/index.html) `cargo nextest run`).
- Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings`).
#### `pre-commit` Hooks
You could setup the [`pre-commit`](https://pre-commit.com/#plugins) hooks to run these checks on every commit automatically.
1. Install `pre-commit`
pip install pre-commit
or
brew install pre-commit
2. Install the `pre-commit` hooks
$ pre-commit install
pre-commit installed at .git/hooks/pre-commit
$ pre-commit install --hook-type commit-msg
pre-commit installed at .git/hooks/commit-msg
$ pre-commit install --hook-type pre-push
pre-commit installed at .git/hooks/pre-push
Now, `pre-commit` will run automatically on `git commit`.
### Title
The titles of pull requests should be prefixed with category name listed in [Conventional Commits specification](https://www.conventionalcommits.org/en/v1.0.0)
like `feat`/`fix`/`doc`, with a concise summary of code change follows. DO NOT use last commit message as pull request title.
The titles of pull requests should be prefixed with category names listed in [Conventional Commits specification](https://www.conventionalcommits.org/en/v1.0.0)
like `feat`/`fix`/`docs`, with a concise summary of code change following. DO NOT use last commit message as pull request title.
### Description
-If your pull request is small, like a typo fix, feel free to go brief.
-Feel free to go brief if your pull request is small, like a typo fix.
- But if it contains large code change, make sure to state the motivation/design details of this PR so that reviewers can understand what you're trying to do.
- If the PR contains any breaking change or API change, make sure that is clearly listed in your description.
@@ -25,11 +93,20 @@ like `feat`/`fix`/`doc`, with a concise summary of code change follows. DO NOT u
All commit messages SHOULD adhere to the [Conventional Commits specification](https://conventionalcommits.org/).
## Getting help
## Getting Help
There are many ways to get help when you're stuck. It is recommended to ask for help by opening an issue, with a detailed description
of what you were trying to do and what went wrong. You can also reach for help in our Slack channel.
of what you were trying to do and what went wrong. You can also reach for help in our [Slack channel](https://greptime.com/slack).
## Community
## Bug report
To report a bug or a security issue, you can [open a new GitHub issue](https://github.com/GrepTimeTeam/greptimedb/issues/new).
The core team will be thrilled if you participate in any way you like. When you are stuck, try ask for help by filing an issue, with a detailed description of what you were trying to do and what went wrong. If you have any questions or if you would like to get involved in our community, please check out:
- [GreptimeDB Community Slack](https://greptime.com/slack)
To compile GreptimeDB from source, you'll need the following:
- Rust
- Protobuf
- OpenSSL
## What is GreptimeDB
#### Rust
GreptimeDB is an open-source time-series database with a special focus on
scalability, analytical capabilities and efficiency. It's designed to work on
infrastructure of the cloud era, and users benefit from its elasticity and commodity
storage.
The easiest way to install Rust is to use [`rustup`](https://rustup.rs/), which will check our `rust-toolchain` file and install correct Rust version for you.
Our core developers have been building time-series data platform
for years. Based on their best-practices, GreptimeDB is born to give you:
#### Protobuf
- A standalone binary that scales to highly-available distributed cluster, providing a transparent experience for cluster users
- Optimized columnar layout for handling time-series data; compacted, compressed, and stored on various storage backends
- Flexible indexes, tackling high cardinality issues down
- C/C++ Toolchain: provides basic tools for compiling and linking. This is
available as `build-essential` on ubuntu and similar name on other platforms.
- Rust: the easiest way to install Rust is to use
[`rustup`](https://rustup.rs/), which will check our `rust-toolchain` file and
install correct Rust version for you.
- Protobuf: `protoc` is required for compiling `.proto` files. `protobuf` is
available from major package manager on macos and linux distributions. You can
find an installation instructions [here](https://grpc.io/docs/protoc-installation/).
**Note that `protoc` version needs to be >= 3.15** because we have used the `optional`
keyword. You can check it with `protoc --version`.
- python3-dev or python3-devel(Optional feature, only needed if you want to run scripts
in CPython, and also need to enable `pyo3_backend` feature when compiling(by `cargo run -F pyo3_backend` or add `pyo3_backend` to src/script/Cargo.toml 's `features.default` like `default = ["python", "pyo3_backend]`)): this install a Python shared library required for running Python
scripting engine(In CPython Mode). This is available as `python3-dev` on
ubuntu, you can install it with `sudo apt install python3-dev`, or
`python3-devel` on RPM based distributions (e.g. Fedora, Red Hat, SuSE). Mac's
`Python3` package should have this shared library by default. More detail for compiling with PyO3 can be found in [PyO3](https://pyo3.rs/v0.18.1/building_and_distribution#configuring-the-python-version)'s documentation.
#### Build with Docker
A docker image with necessary dependencies is provided:
Start GreptimeDB from source code, in standalone mode:
```
// Start datanode with default options.
cargo run -- datanode start
OR
// Start datanode with `http-addr` option.
cargo run -- datanode start --http-addr=0.0.0.0:9999
OR
// Start datanode with `log-dir` and `log-level` options.
cargo run -- --log-dir=logs --log-level=debug datanode start
cargo run -- standalone start
```
Start datanode with config file:
Or if you built from docker:
```
cargo run -- --log-dir=logs --log-level=debug datanode start -c ./config/datanode.example.toml
docker run -p 4002:4002 -v "$(pwd):/tmp/greptimedb" greptime/greptimedb standalone start
```
Start datanode by runing docker container:
Please see [the online document site](https://docs.greptime.com/getting-started/overview#install-greptimedb) for more installation options and [operations info](https://docs.greptime.com/user-guide/operations/overview).
```
docker run -p 3000:3000 \
-p 3001:3001 \
-p 3306:3306 \
greptimedb
```
### Get started
### Start Frontend
Read the [complete getting started guide](https://docs.greptime.com/getting-started/overview#connect) on our [official document site](https://docs.greptime.com/).
Frontend should connect to Datanode, so **Datanode must have been started** at first!
To write and query data, GreptimeDB is compatible with multiple [protocols and clients](https://docs.greptime.com/user-guide/client/overview).
```
// Connects to local Datanode at its default GRPC port: 3001
For Linux and macOS, you can easily download pre-built binaries including official releases and nightly builds that are ready to use.
In most cases, downloading the version without PyO3 is sufficient. However, if you plan to run scripts in CPython (and use Python packages like NumPy and Pandas), you will need to download the version with PyO3 and install a Python with the same version as the Python in the PyO3 version.
We recommend using virtualenv for the installation process to manage multiple Python versions.
Please refer to [contribution guidelines](CONTRIBUTING.md) for more information.
$ brew install pre-commit
$
```
3. Install the git hook scripts:
```
$ pre-commit install
pre-commit installed at .git/hooks/pre-commit
$ pre-commit install --hook-type commit-msg
pre-commit installed at .git/hooks/commit-msg
$ pre-commit install --hook-type pre-push
pre-commit installed at .git/hooks/pre-pus
```
now `pre-commit` will run automatically on `git commit`.
4. Check out branch from `develop` and make your contribution. Follow the [style guide](https://github.com/GreptimeTeam/docs/blob/main/style-guide/zh.md). Create a PR when you are ready, feel free and have fun!
## Acknowledgement
- GreptimeDB uses [Apache Arrow](https://arrow.apache.org/) as the memory model and [Apache Parquet](https://parquet.apache.org/) as the persistent file format.
- GreptimeDB's query engine is powered by [Apache Arrow DataFusion](https://github.com/apache/arrow-datafusion).
- [OpenDAL](https://github.com/datafuselabs/opendal) from [Datafuse Labs](https://github.com/datafuselabs) gives GreptimeDB a very general and elegant data access abstraction layer.
- GreptimeDB’s meta service is based on [etcd](https://etcd.io/).
- GreptimeDB uses [RustPython](https://github.com/RustPython/RustPython) for experimental embedded python scripting.
@@ -55,7 +55,7 @@ The DataFusion basically execute aggregate like this:
2. Call `update_batch` on each accumulator with partitioned data, to let you update your aggregate calculation.
3. Call `state` to get each accumulator's internal state, the medial calculation result.
4. Call `merge_batch` to merge all accumulator's internal state to one.
5. Execute `evalute` on the chosen one to get the final calculation result.
5. Execute `evaluate` on the chosen one to get the final calculation result.
Once you know the meaning of each method, you can easily write your accumulator. You can refer to `Median` accumulator or `SUM` accumulator defined in file `my_sum_udaf_example.rs` for more details.
@@ -63,7 +63,7 @@ Once you know the meaning of each method, you can easily write your accumulator.
You can call `register_aggregate_function` method in query engine to register your aggregate function. To do that, you have to new an instance of struct `AggregateFunctionMeta`. The struct has three fields, first is the name of your aggregate function's name. The function name is case-sensitive due to DataFusion's restriction. We strongly recommend using lowercase for your name. If you have to use uppercase name, wrap your aggregate function with quotation marks. For example, if you define an aggregate function named "my_aggr", you can use "`SELECT MY_AGGR(x)`"; if you define "my_AGGR", you have to use "`SELECT "my_AGGR"(x)`".
The second field is arg_counts ,the count of the arguments. Like accumulator `percentile`, caculating the p_number of the column. We need to input the value of column and the value of p to cacalate, and so the count of the arguments is two.
The second field is arg_counts ,the count of the arguments. Like accumulator `percentile`, calculating the p_number of the column. We need to input the value of column and the value of p to cacalate, and so the count of the arguments is two.
The third field is a function about how to create your accumulator creator that you defined in step 1 above. Create creator, that's a bit intertwined, but it is how we make DataFusion use a newly created aggregate function each time it executes a SQL, preventing the stored input types from affecting each other. The key detail can be starting looking at our `DfContextProviderAdapter` struct's `get_aggregate_meta` method.
A Rust native implementation of PromQL, for GreptimeDB.
# Motivation
Prometheus and its query language PromQL prevails in the cloud-native observability area, which is an important scenario for time series database like GreptimeDB. We already have support for its remote read and write protocols. Users can now integrate GreptimeDB as the storage backend to existing Prometheus deployment, but cannot run PromQL query directly on GreptimeDB like SQL.
This RFC proposes to add support for PromQL. Because it was created in Go, we can't use the existing code easily. For interoperability, performance and extendability, porting its logic to Rust is a good choice.
# Details
## Overview
One of the goals is to make use of our existing basic operators, execution model and runtime to reduce the work. So the entire proposal is built on top of Apache Arrow DataFusion. The rewrote PromQL logic is manifested as `Expr` or `Execution Plan` in DataFusion. And both the intermediate data structure and the result is in the format of `Arrow`'s `RecordBatch`.
The following sections are organized in a top-down manner. Starts with evaluation procedure. Then introduces the building blocks of our new PromQL operation. Follows by an explanation of data model. And end with an example logic plan.
*This RFC is heavily related to Prometheus and PromQL. It won't repeat some basic concepts of them.*
## Evaluation
The original implementation is like an interpreter of parsed PromQL AST. It has two characteristics: (1) Operations are evaluated in place after they are parsed to AST. And some key parameters are separated from the AST because they do not present in the query, but come from other places like another field in the HTTP payload. (2) calculation is performed per timestamp. You can see this pattern many times:
These bring out two differences in the proposed implementation. First, to make it more general and clear, the evaluation procedure is reorganized into serval phases (and is the same as DataFusion's). And second, data are evaluated by time series (corresponding to "columnar calculation", if think timestamp as row number).
Provided by [`promql-parser`](https://github.com/GreptimeTeam/promql-parser) crate. Same as the original implementation.
- Logical Planner
Generates a logical plan with all the needed parameters. It should accept something like `EvalStmt` in Go's implementation, which contains query time range, evaluation interval and lookback range.
Another important thing done here is assembling the logic plan, with all the operations baked into logically. Like what's the filter and time range to read, how the data then flows through a selector into a binary operation, etc. Or what's the output schema of every single step. The generated logic plan is deterministic without variables, and can be `EXPLAIN`ed clearly.
- Physical Planner
This step converts a logic plan into evaluatable execution plan. There are not many special things like the previous step. Except when a query is going to be executed distributedly. In this case, a logic plan will be divided into serval parts and sent to serval nodes. One physical planner only sees its own part.
- Executor
As its name shows, this step calculates data to result. And all new calculation logic, the implementation of PromQL in rust, is placed here. And the rewrote functions are using `RecordBatch` and `Array` from `Arrow` as the intermediate data structure.
Each "batch" contains only data from single time series. This is from the underlying storage implementation. Though it's not a requirement of this RFC, having this property can simplify some functions.
Another thing to mention is the rewrote functions don't aware of timestamp or value columns, they are defined only based on the input data types. For example, `increase()` function in PromQL calculates the unbiased delta of data, its implementation here only does this single thing. Let's compare the signature of two implementations:
Some unimportant parameters are omitted. The original Go version only writes the logic for `Point`'s value, either float or histogram. But the proposed rewritten one accepts a generic `Array` as input, which can be any type that suits, from `i8` to `u64` to `TimestampNanosecond`.
## Plan and Expression
They are structures to express logic from PromQL. The proposed implementation is built on top of DataFusion, thus our plan and expression are in form of `ExtensionPlan` and `ScalarUDF`. The only difference between them in this context is the return type: plan returns a record batch while expression returns a single column.
This RFC proposes to add four new plans, they are fundamental building blocks that mainly handle data selection logic in PromQL, for the following calculation expressions.
- `SeriesNormalize`
Sort data inside one series on the timestamp column, and bias "offset" if has. This plan usually comes after `TableScan` (or `TableScan` and `Filter`) plan.
- `VectorManipulator` and `MatrixManipulator`
Corresponding to `InstantSelector` and `RangeSelector`. We don't calculate timestamp by timestamp, thus use "vector" instead of "instant", this image shows the difference. And "matrix" is another name for "range vector", for not confused with our "vector". The following section will detail how they are implemented using Arrow.

Due to "interval" parameter in PromQL, data after "selector" (or "manipulator" here) are usually shorter than input. And we have to modify the entire record batch to shorten both timestamp, value and tag columns. So they are formed as plan.
- `PromAggregator`
The carrier of aggregator expressions. This should not be very different from the DataFusion built-in `Aggregate` plan, except PromQL can use "group without" to do reverse selection.
PromQL has around 70 expressions and functions. But luckily we can reuse lots of them from DataFusion. Like unary expression, binary expression and aggregator. We only need to implement those PromQL-specific expressions, like `rate` or `percentile`. The following table lists some typical functions in PromQL, and their signature in the proposed implementation. Other function should be the same.
*: *`extrapolate_factor` is one of the "dark sides" in PromQL. In short it's a translation of this [paragraph](https://github.com/prometheus/prometheus/blob/0372e259baf014bbade3134fd79bcdfd8cbdef2c/promql/functions.go#L134-L159)*
To reuse those common calculation logic, we can break them into serval expressions, and assemble in the logic planning phase. Like `rate()` in PromQL can be represented as `increase / extrapolate_factor`.
## Data Model
This part explains how data is represented. Following the data model in GreptimeDB, all the data are stored as table, with tag columns, timestamp column and value column. Table to record batch is very straightforward. So an instant vector can be thought of as a row (though as said before, we don't use instant vectors) in the table. Given four basic types in PromQL: scalar, string, instant vector and range vector, only the last "range vector" need some tricks to adapt our columnar calculation.
Range vector is some sort of matrix, it's consisted of small one-dimension vectors, with each being an input of range function. And, applying range function to a range vector can be thought of kind of convolution.
(Left is an illustration of range vector. Notice the Y-axis has no meaning, it's just put different pieces separately. The right side is an imagined "matrix" as range function. Multiplying the left side to it can get a one-dimension "matrix" with four elements. That's the evaluation result of a range vector.)
To adapt this range vector to record batch, it should be represented by a column. This RFC proposes to use `DictionaryArray` from Arrow to represent range vector, or `Matrix`. This is "misusing" `DictionaryArray` to ship some additional information about an array. Because the range vector is sliding over one series, we only need to know the `offset` and `length` of each slides to reconstruct the matrix from an array:

The length is not fixed, it depends on the input's timestamp. An PoC implementation of `Matrix` and `increase()` can be found in [this repo](https://github.com/waynexia/corroding-prometheus).
Human-being is always error-prone. It's harder to endeavor to rewrite from the ground and requires more attention to ensure correctness, than translate line-by-line. And, since the evaluator's architecture are different, it might be painful to catch up with PromQL's breaking update (if any) in the future.
Misusing Arrow's DictionaryVector as Matrix is another point. This hack needs some `unsafe` function call to bypass Arrow's check. And though Arrow's API is stable, this is still an undocumented behavior.
# Alternatives
There are a few alternatives we've considered:
- Wrap the existing PromQL's implementation via FFI, and import it to GreptimeDB.
- Translate its evaluator engine line-by-line, rather than rewrite one.
- Integrate the Prometheus server into GreptimeDB via RPC, making it a detached execution engine for PromQL.
The first and second options are making a separate execution engine in GreptimeDB, they may alleviate the pain during rewriting, but will have negative impacts to afterward evolve like resource management. And introduce another deploy component in the last option will bring a complex deploy architecture.
And all of them are more or less redundant in data transportation that affects performance and resources. The proposed built-in executing procedure is also easy to integrate and expose to the existing SQL interface GreptimeDB currently provides. Some concepts in PromQL like sliding windows (range vector in PromQL) are very convenient and ergonomic in analyzing series data. This makes it not only a PromQL evaluator, but also an enhancement to our query system.
A framework for executing operations in a fault-tolerant manner.
# Motivation
Some operations in GreptimeDB require multiple steps to implement. For example, creating a table needs:
1. Check whether the table exists
2. Create the table in the table engine
1. Create a region for the table in the storage engine
2. Persist the metadata of the table to the table manifest
3. Add the table to the catalog manager
If the node dies or restarts in the middle of creating a table, it could leave the system in an inconsistent state. The procedure framework, inspired by [Apache HBase's ProcedureV2 framework](https://github.com/apache/hbase/blob/bfc9fc9605de638785435e404430a9408b99a8d0/src/main/asciidoc/_chapters/pv2.adoc) and [Apache Accumulo’s FATE framework](https://accumulo.apache.org/docs/2.x/administration/fate), aims to provide a unified way to implement multi-step operations that is tolerant to failure.
# Details
## Overview
The procedure framework consists of the following primary components:
- A `Procedure` represents an operation or a set of operations to be performed step-by-step
-`ProcedureManager`, the runtime to run `Procedures`. It executes the submitted procedures, stores procedures' states to the `ProcedureStore` and restores procedures from `ProcedureStore` while the database restarts.
-`ProcedureStore` is a storage layer for persisting the procedure state
## Procedures
The `ProcedureManager` keeps calling `Procedure::execute()` until the Procedure is done, so the operation of the Procedure should be [idempotent](https://developer.mozilla.org/en-US/docs/Glossary/Idempotent): it needs to be able to undo or replay a partial execution of itself.
The `Status` is an enum that has the following variants:
```rust
enumStatus{
Executing{
persist: bool,
},
Suspended{
subprocedures: Vec<ProcedureWithId>,
persist: bool,
},
Done,
}
```
A call to `execute()` can result in the following possibilities:
-`Ok(Status::Done)`: we are done
-`Ok(Status::Executing { .. })`: there are remaining steps to do
-`Ok(Status::Suspend { sub_procedure, .. })`: execution is suspended and can be resumed later after the sub-procedure is done.
-`Err(e)`: error occurs during execution and the procedure is unable to proceed anymore.
Users need to assign a unique `ProcedureId` to the procedure and the procedure can get this id via the `Context`. The `ProcedureId` is typically a UUID.
```rust
structContext{
id: ProcedureId,
// other fields ...
}
```
The `ProcedureManager` calls `Procedure::dump()` to serialize the internal state of the procedure and writes to the `ProcedureStore`. The `Status` has a field `persist` to tell the `ProcedureManager` whether it needs persistence.
## Sub-procedures
A procedure may need to create some sub-procedures to process its subtasks. For example, creating a distributed table with multiple regions (partitions) needs to set up the regions in each node, thus the parent procedure should instantiate a sub-procedure for each region. The `ProcedureManager` makes sure that the parent procedure does not proceed till all sub-procedures are successfully finished.
The procedure can submit sub-procedures to the `ProcedureManager` by returning `Status::Suspended`. It needs to assign a procedure id to each procedure manually so it can track the status of the sub-procedures.
```rust
structProcedureWithId{
id: ProcedureId,
procedure: BoxedProcedure,
}
```
## ProcedureStore
We might need to provide two different ProcedureStore implementations:
- In standalone mode, it stores data on the local disk.
- In distributed mode, it stores data on the meta server or the object store service.
These implementations should share the same storage structure. They store each procedure's state in a unique path based on the procedure id:
```
Sample paths:
/procedures/{PROCEDURE_ID}/000001.step
/procedures/{PROCEDURE_ID}/000002.step
/procedures/{PROCEDURE_ID}/000003.commit
```
`ProcedureStore` behaves like a WAL. Before performing each step, the `ProcedureManager` can write the procedure's current state to the ProcedureStore, which stores the state in the `.step` file. The `000001` in the path is a monotonic increasing sequence of the step. After the procedure is done, the `ProcedureManager` puts a `.commit` file to indicate the procedure is finished (committed).
The `ProcedureManager` can remove the procedure's files once the procedure is done, but it needs to leave the `.commit` as the last file to remove in case of failure during removal.
## ProcedureManager
`ProcedureManager` executes procedures submitted to it.
- Register a `ProcedureLoader` by the type name of the `Procedure`.
- Submit a `Procedure` to the manager and execute it.
When `ProcedureManager` starts, it loads procedures from the `ProcedureStore` and restores the procedures by the `ProcedureLoader`. The manager stores the type name from `Procedure::type_name()` with the data from `Procedure::dump()` in the `.step` file and uses the type name to find a `ProcedureLoader` to recover the procedure from its data.
The rollback step is supposed to clean up the resources created during the execute() step. When a procedure has failed, the `ProcedureManager` puts a `rollback` file and calls the `Procedure::rollback()` method.
```text
/procedures/{PROCEDURE_ID}/000001.step
/procedures/{PROCEDURE_ID}/000002.rollback
```
Rollback is complicated to implement so some procedures might not support rollback or only provide a best-efforts approach.
## Locking
The `ProcedureManager` can provide a locking mechanism that gives a procedure read/write access to a database object such as a table so other procedures are unable to modify the same table while the current one is executing.
# Drawbacks
The `Procedure` framework introduces additional complexity and overhead to our database.
- To execute a `Procedure`, we need to write to the `ProcedureStore` multiple times, which may slow down the server
- We need to rewrite the logic of creating/dropping/altering a table using the procedure framework
# Alternatives
Another approach is to tolerate failure during execution and allow users to retry the operation until it succeeds. But we still need to:
- Make each step idempotent
- Record the status in some place to check whether we are done
GreptimeDB uses an LSM-tree based storage engine that flushes memtables to SSTs for persistence.
But currently it only supports level 0. SST files in level 0 does not guarantee to contain only rows with disjoint time ranges.
That is to say, different SST files in level 0 may contain overlapped timestamps.
The consequence is, in order to retrieve rows in some time range, all files need to be scanned, which brings a lot of IO overhead.
Also, just like other LSMT engines, delete/update to existing primary keys are converted to new rows with delete/update mark and appended to SSTs on flushing.
We need to merge the operations to same primary keys so that we don't have to go through all SST files to find the final state of these primary keys.
## Goal
Implement a compaction framework to:
- maintain SSTs in timestamp order to accelerate queries with timestamp condition;
- merge rows with same primary key;
- purge expired SSTs;
- accommodate other tasks like data rollup/indexing.
## Overview
Table compaction involves following components:
- Compaction scheduler: run compaction tasks, limit the consumed resources;
- Compaction strategy: find the SSTs to compact and determine the output files of compaction.
- Compaction task: read the rows from input SSTs and write to the output files.
## Implementation
### Compaction scheduler
`CompactionScheduler` is an executor that continuously polls and executes compaction request from a task queue.
`CompactionStrategy` defines how to pick SSTs in all levels for compaction.
```rust
pubtraitCompactionStrategy{
fnpick(
&self,
ctx: CompactionContext,
levels: &LevelMetas,
)-> Result<CompactionTask>;
}
```
The most suitable compaction strategy for time-series scenario would be
a hybrid strategy that combines time window compaction with size-tired compaction, just like [Cassandra](https://cassandra.apache.org/doc/latest/cassandra/operating/compaction/twcs.html) and [ScyllaDB](https://docs.scylladb.com/stable/architecture/compaction/compaction-strategies.html#time-window-compaction-strategy-twcs) does.
We can first group SSTs in level n into buckets according to some predefined time window. Within that window,
SSTs are compacted in a size-tired manner (find SSTs with similar size and compact them to level n+1).
SSTs from different time windows are neven compacted together.
That strategy guarantees SSTs in each level are mainly sorted in timestamp order which boosts queries with
explicit timestamp condition, while size-tired compaction minimizes the impact to foreground writes.
### Alternatives
Currently, GreptimeDB's storage engine [only support two levels](https://github.com/GreptimeTeam/greptimedb/blob/43aefc5d74dfa73b7819cae77b7eb546d8534a41/src/storage/src/sst.rs#L32).
For level 0, we can start with a simple time-window based leveled compaction, which reads from all SSTs in level 0,
align them to time windows with a fixed duration, merge them with SSTs in level 1 within the same time window
to ensure there is only one sorted run in level 1.
This RFC proposes a method to achieve fault tolerance for regions in GreptimeDB's distributed mode. Or, put it in another way, achieving region high availability("HA") for GreptimeDB cluster.
In this RFC, we mainly describe two aspects of region HA: how region availability is detected, and what recovery process is need to be taken. We also discuss some alternatives and future work.
When this feature is done, our users could expect a GreptimeDB cluster that can always handle their requests to regions, despite some requests may failed during the region failover. The optimization to reduce the MTTR(Mean Time To Recovery) is not a concern of this RPC, and is left for future work.
# Motivation
Fault tolerance for regions is a critical feature for our clients to use the GreptimeDB cluster confidently. High availability for users to interact with their stored data is a "must have" for any TSDB products, that include our GreptimeDB cluster.
# Details
## Background
Some backgrounds about region in distributed mode:
- A table is logically split into multiple regions. Each region stores a part of non-overlapping table data.
- Regions are distributed in Datanodes, the mappings are not static, are assigned and governed by Metasrv.
- In distributed mode, client requests are scoped in regions. To be more specific, when a request that needs to scan multiple regions arrived in Frontend, Frontend splits the request into multiple sub-requests, each of which scans one region only, and submits them to Datanodes that hold corresponding regions.
In conclusion, as long as regions remain available, and regions could regain availability when failures do occur, the overall region HA could be achieved. With this in mind, let's see how region failures are detected first.
## Failure Detection
We detect region failures in Metasrv, and do it both passively and actively. Passively means that Metasrv do not fire some "are you healthy" requests to regions. Instead, we carry region healthy information in the heartbeat requests that are submit to Metasrv by Datanodes.
Datanode already carries its regions stats in the heartbeat request (the non-relevant fields are omitted):
```protobuf
messageHeartbeatRequest{
...
// Region stats on this node
repeatedRegionStatregion_stats=6;
...
}
messageRegionStat{
uint64region_id=1;
TableNametable_name=2;
...
}
```
For the sake of simplicity, we don't add another field `bool available = 3` to the `RegionStat` message; instead, if the region were unavailable in the view of the Datanode that contains it, the Datanode just not includes the `RegionStat` of it in the heartbeat request. Or, if the Datanode itself is not unavailable, the heartbeat request is not submitted, effectively the same with not carrying the `RegionStat`.
> The heartbeat interval is now hardcoded to five seconds.
Metasrv gathers the heartbeat requests, extracts the `RegionStat`s, and treat them as region heartbeat. In this way, Metasrv maintains all regions healthy information. If some region's heartbeats were not received in a period of time, Metasrv speculates the region might be unavailable. To make the decision whether a region is failed or not, Metasrv uses a failure detection algorithm called the "[Phi φ Accrual Failure Detection](https://medium.com/@arpitbhayani/phi-%CF%86-accrual-failure-detection-79c21ce53a7a)". Basically, the algorithm calculates a value called "phi" to represent the possibility of a region's unavailability, based on the historical heartbeats' arrived rate. Once the "phi" is above some pre-defined threshold, Metasrv knows the region is failed.
> This algorithm has been widely adopted in some well known products, like Akka and Cassandra.
When Metasrv decides some region is failed from heartbeats, it's not the final decision. Here comes the "actively" detection. Before Metasrv decides to do region failover, it actively invokes the healthy check interface of the Datanode that the failure region resides. Only this healthy check is failed does Metasrv actually start doing failover upon the region.
To conclude, the failure detection pseudo-codes are like this:
- Why active detecting while we have passively detection? Because it could be happened that the network is singly connectable sometimes (especially in the complex Cloud environment), then the Datanode's heartbeats cannot reach Metasrv, while Metasrv could request Datanode. Active detecting avoid this false positive situation.
- Why the detection works on region instead of Datanode? Because we might face the possibility that only part of the regions in the Datanode are not available, not ALL regions. Especially the situation that Datanodes are used by multiple tenants. If this is the case, it's better to do failover upon the designated regions instead of the whole regions that reside on the Datanode. All in all, we want a more subtle control over region failover.
So we detect some regions are not available. How to regain the availability back?
## Region Failover
Region Failover largely relies on remote WAL, aka "[Bunshin](https://github.com/GreptimeTeam/bunshin)". I'm not including any of the details of it in this RFC, let's just assume we already have it.
In general, region failover is fairly simple. Once Metasrv decides to do failover upon some regions, it first chooses one or more Datanodes to hold the failed region. This can be done easily, as the Metasrv already has the whole picture of Datanodes: it knows which Datanode has the minimum regions, what Datanode historically had the lowest CPU usage and IO rate, and how the Datanodes are assigned to tenants, among other information that can all help the Metasrv choose the most suitable Datanodes. Let's call these chosen Datanodes as "candidates".
> The strategy to choose the most suitable candidates required careful design, but it's another RFC.
Then, Metasrv sets the states of these failed regions as "passive". We should add a field to `Region`:
```protobuf
messageRegion{
uint64id=1;
stringname=2;
Partitionpartition=3;
messageState{
Active,
Passive,
}
Statestate=4;
map<string,string>attrs=100;
}
```
Here `Region` is used in message `RegionRoute`, which indicates how the write request is split among regions. When a region is set as "passive", Frontend knows the write to it should be rejected at the moment (the region read is not blocked, however).
> Making a region "passive" here is effectively blocking the write to it. It's ok in the failover situation, the region is failed anyway. However, when dealing with active maintenance operations, region state requires more refined design. But that's another story.
Third, Metasrv fires the "close region" requests to the failed Datanodes, and fires the "open region" requests to those candidates. "Close region" requests might be failed due to the unavailability of Datanodes, but that's fine, it's just a best-effort attempt to reduce the chance of any in-flight writes got handled unintentionally after the region is set as "passive". The "open region" requests must have succeeded though. Datanodes open regions from remote WAL.
> Currently the "close region" is undefined in Datanode. It could be a local cache clean up of region data or other resources tidy up.
Finally, when a candidate successfully opens its region, it calls back to Metasrv, indicating it is ready to handle region. "call back" here is backed by its heartbeat to Metasrv. Metasrv updates the region's state to "active", so as to let Frontend lifts the restrictions of region writes (again, the read part of region is untouched).
All the above steps should be managed by remote procedure framework. It's another implementation challenge in the region failover feature. (One is the remote WAL of course.)
Remote WAL raises a problem that could harm the write throughput of GreptimeDB cluster: each write request has to do at least two remote call, one is from Frontend to Datanode, and one is from Datanode to remote WAL. What if we do it the "[Neon](https://github.com/neondatabase/neon)" way, making remote WAL sits in between the Frontend and Datanode, couldn't that improve our write throughput? It could, though there're some consistency issues like "read-your-writes" to solve.
However, the main concerns we don't adopt this method are two-fold:
1. Remote WAL is planned to be quorum based, it can be efficiently written;
2. More importantly, we are planning to make the remote WAL an option that users could choose not to enable it (at the cost of some reliability reduction).
## No WAL, Replication instead
This method replicates region across Datanodes directly, like the common way in shared-nothing database. Were the main region failed, a standby region in the replicate group is elected as new "main" and take the read/write requests. The main concern to this method is the incompatibility to our current architecture and code structure. It requires a major redesign, but gains no significant advantage over the remote WAL method.
However, the replication does have its own advantage that we can learn from to optimize this failover procedure.
# Future Work
Some optimizations we could take:
- To reduce the MTTR, we could make Metasrv chooses the candidate to each region at normal time. The candidate does some preparation works to reduce the open region time, effectively accelerate the failover procedure.
- We can adopt the replication method, to the degree that region replicas are used as the fast catch-up candidates. The data difference among replicas is minor, region failover does not need to load or exchange too much data, greatly reduced the region failover time.
User data may already exist in other storages, i.g., file systems/s3, etc. in CSV, parquet, JSON format, etc. We can provide users the ability to perform SQL queries on these files.
# Details
## Overview
The file external table providers users ability to perform SQL queries on these files.
For example, a user has a CSV file on the local file system `/var/data/city.csv`:
```
Rank , Name , State , 2023 Population , 2020 Census , Annual Change , Density (mi²)
1 , New York City , New York , 8,992,908 , 8,804,190 , 0.7% , 29,938
2 , Los Angeles , California , 3,930,586 , 3,898,747 , 0.27% , 8,382
<col_name> <col_type> [NULL | NOT NULL] [COMMENT "<comment>"]
)
]
[ WITH
(
LOCATION = 'url'
[,FIELD_DELIMITER = 'delimiter' ]
[,RECORD_DELIMITER = 'delimiter' ]
[,SKIP_HEADER = '<number>' ]
[,FORMAT = { csv | json | parquet } ]
[,PATTERN = '<regex_pattern>' ]
[,ENDPOINT = '<uri>' ]
[,ACCESS_KEY_ID = '<key_id>' ]
[,SECRET_ACCESS_KEY = '<access_key>' ]
[,SESSION_TOKEN = '<token>' ]
[,REGION = '<region>' ]
[,ENABLE_VIRTUAL_HOST_STYLE = '<boolean>']
..
)
]
```
### Supported File Format
The external file table supports multiple formats; We divide formats into row format and columnar format.
Row formats:
- CSV, JSON
Columnar formats:
- Parquet
Some of these formats support filter pushdown, and others don't. If users use very large files, that format doesn't support pushdown, which might consume a lot of IO for scanning full files and cause a long running query.
### File Table Engine

We implement a file table engine that creates an external table by accepting user-specified file paths and treating all records as immutable.
1. File Format Decoder: decode files to the `RecordBatch` stream.
2. File Table Engine: implement the `TableProvider` trait, store necessary metadata in memory, and provide scan ability.
Our implementation is better for small files. For large files(i.g., a GB-level CSV file), suggests our users import data to the database.
## Drawbacks
- Some formats don't support filter pushdown
- Hard to support indexing
## Life cycle
### Register a table
1. Write metadata to manifest.
2. Create the table via file table engine.
3. Register table to `CatalogProvider` and register table to `SystemCatalog`(persist tables to disk).
### Deregister a table (Drop a table)
1. Fetch the target table info (figure out table engine type).
2. Deregister the target table in `CatalogProvider` and `SystemCatalog`.
3. Find the target table engine.
4. Drop the target table.
### Recover a table when restarting
1. Collect tables name and engine type info.
2. Find the target tables in different engines.
3. Open and register tables.
# Alternatives
## Using DataFusion API
We can use datafusion API to register a file table:
letdf=ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
```
### Drawbacks
The DataFusion implements its own `Object Store` abstraction and supports parsing the partitioned directories, which can push down the filter and skips some directories. However, this makes it impossible to use our's `LruCacheLayer`(The parsing of the partitioned directories required paths as input). If we want to manage memory entirely, we should implement our own `TableProvider` or `Table`.
- Impossible to use `CacheLayer`
## Introduce an intermediate representation layer

We convert all files into `parquet` as an intermediate representation. Then we only need to implement a `parquet` file table engine, and we already have a similar one. Also, it supports limited filter pushdown via the `parquet` row group stats.
Enhance the logical planner with aware of distributed, multi-region table topology. To achieve "push computation down" execution rather than the current "pull data up" manner.
# Motivation
Query distributively can leverage the advantage of GreptimeDB's architecture to process large dataset that exceeds the capacity of a single node, or accelerate the query execution by executing it in parallel. This task includes two sub-tasks
- Be able to transform the plan that can push as much as possible computation down to data source.
- Be able to handle pipeline breaker (like `Join` or `Sort`) on multiple computation nodes.
This is a relatively complex topic. To keep this RFC concentrated I'll focus on the first one.
# Details
## Background: Partition and Region
GreptimeDB supports table partitioning, where the partition rule is set during table creation. Each partition can be further divided into one or more physical storage units known as "regions". Both partitions and regions are divided based on rows:
``` text
┌────────────────────────────────────┐
│ │
│ Table │
│ │
└─────┬────────────┬────────────┬────┘
│ │ │
│ │ │
┌─────▼────┐ ┌─────▼────┐ ┌─────▼────┐
│ Region 1 │ │ Region 2 │ │ Region 3 │
└──────────┘ └──────────┘ └──────────┘
Row 1~10 Row 11~20 Row 21~30
```
General speaking, region is the minimum element of data distribution, and we can also use it as the unit to distribute computation. This can greatly simplify the routing logic of this distributed planner, by always schedule the computation to the node that currently opening the corresponding region. And is also easy to scale more node for computing since GreptimeDB's data is persisted on shared storage backend like S3. But this is a bit beyond the scope of this specific topic.
## Background: Commutativity
Commutativity is an attribute that describes whether two operation can exchange their apply order: $P1(P2(R)) \Leftrightarrow P2(P1(R))$. If the equation keeps, we can transform one expression into another form without changing its result. This is useful on rewriting SQL expression, and is the theoretical basis of this RFC.
Take this SQL as an example
``` sql
SELECT a FROM t WHERE a > 10;
```
As we know projection and filter are commutative (todo: latex), it can be translated to the following two identical plan trees:
```text
┌─────────────┐ ┌─────────────┐
│Projection(a)│ │Filter(a>10) │
└──────▲──────┘ └──────▲──────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│Filter(a>10) │ │Projection(a)│
└──────▲──────┘ └──────▲──────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ TableScan │ │ TableScan │
└─────────────┘ └─────────────┘
```
## Merge Operation
This RFC proposes to add a new expression node `MergeScan` to merge result from several regions in the frontend. It wrap the abstraction of remote data and execution, and expose a `TableScan` interface to upper level.
``` text
▲
│
┌───────┼───────┐
│ │ │
│ ┌──┴──┐ │
│ └──▲──┘ │
│ │ │
│ ┌──┴──┐ │
│ └──▲──┘ │ ┌─────────────────────────────┐
│ │ │ │ │
│ ┌────┴────┐ │ │ ┌──────────┐ ┌───┐ ┌───┐ │
│ │MergeScan◄──┼────┤ │ Region 1 │ │ │ .. │ │ │
│ └─────────┘ │ │ └──────────┘ └───┘ └───┘ │
│ │ │ │
└─Frontend──────┘ └─Remote-Sources──────────────┘
```
This merge operation simply chains all the the underlying remote data sources and return `RecordBatch`, just like a coalesce op. And each remote sources is a gRPC query to datanode via the substrait logical plan interface. The plan is transformed and divided from the original query that comes to frontend.
## Commutativity of MergeScan
Obviously, The position of `MergeScan` is the key of the distributed plan. The more closer to the underlying `TableScan`, the less computation is taken by datanodes. Thus the goal is to pull the `MergeScan` up as more as possible. The word "pull up" means exchange `MergeScan` with its parent node in the plan tree, which means we should check the commutativity between the existing expression nodes and the `MergeScan`. Here I classify all the possibility into five categories:
After establishing the set of commutative relations for all expressions, we can begin transforming the logical plan. There are four steps:
- Add a merge node before table scan
- Evaluate commutativity in a bottom-up way, stop at the first non-commutative node
- Divide the TableScan to scan over partitions
- Execute
First insert the `MergeScan` on top of the bottom `TableScan` node. Then examine the commutativity start from the `MergeScan` node transform the plan tree based on the result. Stop this process on the first non-commutative node.
``` text
┌─────────────┐ ┌─────────────┐
│ Sort │ │ Sort │
└──────▲──────┘ └──────▲──────┘
│ │
┌─────────────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Sort │ │Projection(a)│ │ MergeScan │
└──────▲──────┘ └──────▲──────┘ └──────▲──────┘
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│Projection(a)│ │ MergeScan │ │Projection(a)│
└──────▲──────┘ └──────▲──────┘ └──────▲──────┘
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ TableScan │ │ TableScan │ │ TableScan │
└─────────────┘ └─────────────┘ └─────────────┘
(a) (b) (c)
```
Then in the physical planning phase, convert the sub-tree below `MergeScan` into a remote query request and dispatch to all the regions. And let the `MergeScan` to receive the results and feed to it parent node.
To simplify the overall complexity, any error in the procedure will lead to a failure to the entire query, and cancel all other parts.
# Alternatives
## Spill
If only consider the ability of processing large dataset, we can enable DataFusion's spill ability to temporary persist intermediate data into disk, like the "swap" memory. But this will lead to a super slow performance and very large write amplification.
# Future Work
As described in the `Motivation` section we can further explore the distributed planner on the physical execution level, by introducing mechanism like Spark's shuffle to improve parallelism and reduce intermediate pipeline breaker's stage.
The `datatypes` crate defines the elementary schema struct to describe the metadata.
## ColumnSchema
[ColumnSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/datatypes/src/schema/column_schema.rs#L36) represents the metadata of a column. It is equivalent to arrow's [Field](https://docs.rs/arrow/latest/arrow/datatypes/struct.Field.html) with additional metadata such as default constraint and whether the column is a time index. The time index is the column with a `TIME INDEX` constraint of a table. We can convert the `ColumnSchema` into an arrow `Field` and convert the `Field` back to the `ColumnSchema` without losing metadata.
[Schema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/datatypes/src/schema.rs#L38) is an ordered sequence of `ColumnSchema`. It is equivalent to arrow's [Schema](https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html) with additional metadata including the index of the time index column and the version of this schema. Same as `ColumnSchema`, we can convert our `Schema` from/to arrow's `Schema`.
```rust
usearrow::datatypes::SchemaasArrowSchema;
pubstructSchema{
column_schemas: Vec<ColumnSchema>,
name_to_index: HashMap<String,usize>,
arrow_schema: Arc<ArrowSchema>,
timestamp_index: Option<usize>,
version: u32,
}
pubtypeSchemaRef=Arc<Schema>;
```
We alias `Arc<Schema>` as `SchemaRef` since it is used frequently. Mostly, we use our `ColumnSchema` and `Schema` structs instead of Arrow's `Field` and `Schema` unless we need to invoke third-party libraries (like DataFusion or ArrowFlight) that rely on Arrow.
## RawSchema
`Schema` contains fields like a map from column names to their indices in the `ColumnSchema` sequences and a cached arrow `Schema`. We can construct these fields from the `ColumnSchema` sequences thus we don't want to serialize them. This is why we don't derive `Serialize` and `Deserialize` for `Schema`. We introduce a new struct [RawSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/datatypes/src/schema/raw.rs#L24) which keeps all required fields of a `Schema` and derives the serialization traits. To serialize a `Schema`, we need to convert it into a `RawSchema` first and serialize the `RawSchema`.
```rust
pubstructRawSchema{
pubcolumn_schemas: Vec<ColumnSchema>,
pubtimestamp_index: Option<usize>,
pubversion: u32,
}
```
We want to keep the `Schema` simple and avoid putting too much business-related metadata in it as many different structs or traits rely on it.
# Schema of the Table
A table maintains its schema in [TableMeta](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/table/src/metadata.rs#L97).
```rust
pubstructTableMeta{
pubschema: SchemaRef,
pubprimary_key_indices: Vec<usize>,
pubvalue_indices: Vec<usize>,
// ...
}
```
The order of columns in `TableMeta::schema` is the same as the order specified in the `CREATE TABLE` statement which users use to create this table.
The field `primary_key_indices` stores indices of primary key columns. The field `value_indices` records the indices of value columns (non-primary key and time index, we sometimes call them field columns).
We split a table into one or more units with the same schema and then store these units in the storage engine. Each unit is a region in the storage engine.
The storage engine maintains schemas of regions in more complicated ways because it
- adds internal columns that are invisible to users to store additional metadata for each row
- provides a data model similar to the key-value model so it organizes columns in a different order
- maintains additional metadata like column id or column family
So the storage engine defines several schema structs:
- RegionSchema
- StoreSchema
- ProjectedSchema
## RegionSchema
A [RegionSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/region.rs#L37) describes the schema of a region.
```rust
pubstructRegionSchema{
user_schema: SchemaRef,
store_schema: StoreSchemaRef,
columns: ColumnsMetadataRef,
}
```
Each region reserves some columns called `internal columns` for internal usage:
-`__sequence`, sequence number of a row
-`__op_type`, operation type of a row, such as `PUT` or `DELETE`
-`__version`, user-specified version of a row, reserved but not used. We might remove this in the future
The table engine can't see the `__sequence` and `__op_type` columns, so the `RegionSchema` itself maintains two internal schemas:
- User schema, a `Schema` struct that doesn't have internal columns
- Store schema, a `StoreSchema` struct that has internal columns
The `ColumnsMetadata` struct keeps metadata about all columns but most time we only need to use metadata in user schema and store schema, so we just ignore it. We may remove this struct in the future.
`RegionSchema` organizes columns in the following order:
```
key columns, timestamp, [__version,] value columns, __sequence, __op_type
```
We can ignore the `__version` column because it is disabled now:
```
key columns, timestamp, value columns, __sequence, __op_type
```
Key columns are columns of a table's primary key. Timestamp is the time index column. A region sorts all rows by key columns, timestamp, sequence, and op type.
So the `RegionSchema` of our `cpu` table above looks like this:
```json
{
"user_schema":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system"
],
"store_schema":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"__sequence",
"__op_type"
]
}
```
## StoreSchema
As described above, a [StoreSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/store.rs#L36) is a schema that knows all internal columns.
```rust
structStoreSchema{
columns: Vec<ColumnMetadata>,
schema: SchemaRef,
row_key_end: usize,
user_column_end: usize,
}
```
The columns in the `columns` and `schema` fields have the same order. The `ColumnMetadata` has metadata like column id, column family id, and comment. The `StoreSchema` also stores this metadata in `StoreSchema::schema`, so we can convert the `StoreSchema` between arrow's `Schema`. We use this feature to persist the `StoreSchema` in the SST since our SST format is `Parquet`, which can take arrow's `Schema` as its schema.
The `StoreSchema` of the region above is similar to this:
```json
{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"__sequence",
"__op_type"
],
"time_index":2,
"version":0
},
"row_key_end":3,
"user_column_end":5
}
```
The key and timestamp columns form row keys of rows. We put them together so we can use `row_key_end` to get indices of all row key columns. Similarly, we can use the `user_column_end` to get indices of all user columns (non-internal columns).
Another useful feature of `StoreSchema` is that we ensure it always contains key columns, a timestamp column, and internal columns because we need them to perform merge, deduplication, and delete. Projection on `StoreSchema` only projects value columns.
## ProjectedSchema
To support arbitrary projection, we introduce the [ProjectedSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/projected.rs#L106).
```rust
pubstructProjectedSchema{
projection: Option<Projection>,
schema_to_read: StoreSchemaRef,
projected_user_schema: SchemaRef,
}
```
We need to handle many cases while doing projection:
- The columns' order of table and region is different
- The projection can be in arbitrary order, e.g. `select usage_user, host from cpu` and `select host, usage_user from cpu` have different projection order
- We support `ALTER TABLE` so data files may have different schemas.
### Projection
Let's take an example to see how projection works. Suppose we want to select `ts`, `usage_system` from the `cpu` table.
The query engine uses the projection `[0, 3]` to scan the table. However, columns in the region have a different order, so the table engine adjusts the projection to `2, 4`.
```json
{
"user_schema":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system"
],
}
```
As you can see, the output order is still `[ts, usage_system]`. This is the schema users can see after projection so we call it `projected user schema`.
But the storage engine also needs to read key columns, a timestamp column, and internal columns. So we maintain a `StoreSchema` after projection in the `ProjectedSchema`.
The `Projection` struct is a helper struct to help compute the projected user schema and store schema.
So we can construct the following `ProjectedSchema`:
```json
{
"schema_to_read":{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_system",
"__sequence",
"__op_type"
],
"time_index":2,
"version":0
},
"row_key_end":3,
"user_column_end":4
},
"projected_user_schema":{
"column_schemas":[
"ts",
"usage_system"
],
"time_index":0
}
}
```
As you can see, `schema_to_read` doesn't contain the column `usage_user` that is not intended to be read (not in projection).
### ReadAdapter
As mentioned above, we can alter a table so the underlying files (SSTs) and memtables in the storage engine may have different schemas.
To simplify the logic of `ProjectedSchema`, we handle the difference between schemas before projection (constructing the `ProjectedSchema`). We introduce [ReadAdapter](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/compat.rs#L90) that adapts rows with different source schemas to the same expected schema.
So we can always use the current `RegionSchema` of the region to construct the `ProjectedSchema`, and then create a `ReadAdapter` for each memtable or SST.
```rust
#[derive(Debug)]
pubstructReadAdapter{
source_schema: StoreSchemaRef,
dest_schema: ProjectedSchemaRef,
indices_in_result: Vec<Option<usize>>,
is_source_needed: Vec<bool>,
}
```
For each column required by `dest_schema`, `indices_in_result` stores the index of that column in the row read from the source memtable or SST. If the source row doesn't contain that column, the index is `None`.
The field `is_source_needed` stores whether a column in the source memtable or SST is needed.
Suppose we add a new column `usage_idle` to the table `cpu`.
```sql
ALTERTABLEcpuADDCOLUMNusage_idleDOUBLE;
```
The new `StoreSchema` becomes:
```json
{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"usage_idle",
"__sequence",
"__op_type"
],
"time_index":2,
"version":1
},
"row_key_end":3,
"user_column_end":6
}
```
Note that we bump the version of the schema to 1.
If we want to select `ts`, `usage_system`, and `usage_idle`. While reading from the old schema, the storage engine creates a `ReadAdapter` like this:
```json
{
"source_schema":{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"__sequence",
"__op_type"
],
"time_index":2,
"version":0
},
"row_key_end":3,
"user_column_end":5
},
"dest_schema":{
"schema_to_read":{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_system",
"usage_idle",
"__sequence",
"__op_type"
],
"time_index":2,
"version":1
},
"row_key_end":3,
"user_column_end":5
},
"projected_user_schema":{
"column_schemas":[
"ts",
"usage_system",
"usage_idle"
],
"time_index":0
}
},
"indices_in_result":[
0,
1,
2,
3,
null,
4,
5
],
"is_source_needed":[
true,
true,
true,
false,
true,
true,
true
]
}
```
We don't need to read `usage_user` so `is_source_needed[3]` is false. The old schema doesn't have column `usage_idle` so `indices_in_result[4]` is `null` and the `ReadAdapter` needs to insert a null column to the output row so the output schema still contains `usage_idle`.
The figure below shows the relationship between `RegionSchema`, `StoreSchema`, `ProjectedSchema`, and `ReadAdapter`.
```text
┌──────────────────────────────┐
│ │
│ ┌────────────────────┐ │
│ │ store_schema │ │
│ │ │ │
│ │ StoreSchema │ │
│ │ version 1 │ │
│ └────────────────────┘ │
│ │
│ ┌────────────────────┐ │
│ │ user_schema │ │
│ └────────────────────┘ │
│ │
│ RegionSchema │
│ │
└──────────────┬───────────────┘
│
│
│
┌──────────────▼───────────────┐
│ │
│ ┌──────────────────────────┐ │
│ │ schema_to_read │ │
│ │ │ │
│ │ StoreSchema (projected) │ │
│ │ version 1 │ │
│ └──────────────────────────┘ │
┌───┤ ├───┐
│ │ ┌──────────────────────────┐ │ │
│ │ │ projected_user_schema │ │ │
│ │ └──────────────────────────┘ │ │
│ │ │ │
│ │ ProjectedSchema │ │
dest schema │ └──────────────────────────────┘ │ dest schema
lets=r#"{"node_id":1,"regions_id_map":{"1":[0]},"table_info":{"ident":{"table_id":1098,"version":1},"name":"container_cpu_limit","desc":"Created on insertion","catalog_name":"greptime","schema_name":"dd","meta":{"schema":{"column_schemas":[{"name":"container_id","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"container_name","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"docker_image","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"host","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"image_name","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"image_tag","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"interval","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"runtime","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"short_image","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"type","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"dd_value","data_type":{"Float64":{}},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}},{"name":"ts","data_type":{"Timestamp":{"Millisecond":null}},"is_nullable":false,"is_time_index":true,"default_constraint":null,"metadata":{"greptime:time_index":"true"}},{"name":"git.repository_url","data_type":{"String":null},"is_nullable":true,"is_time_index":false,"default_constraint":null,"metadata":{}}],"timestamp_index":11,"version":1},"primary_key_indices":[0,1,2,3,4,5,6,7,8,9,12],"value_indices":[10,11],"engine":"mito","next_column_id":12,"region_numbers":[],"engine_options":{},"options":{},"created_on":"1970-01-01T00:00:00Z"},"table_type":"Base"}}"#;
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.