* chore: add logs and metrics
* feat: add the timer to track heartbeat intervel
* feat: add the gauge to track region leases
* refactor: use gauge instead of the timer
* chore: apply suggestions from CR
* feat: add hit rate and etcd txn metrics
* feat: add random weigted choose in load_based selector
* fix: meta cannot save heartbeats when cluster have no region
* chore: print some log
* chore: remove unused code
* cr
* add some logs when filter result is empty
* feat: add page cache
* docs: update mito config toml
* feat: impl CachedPageReader
* feat: use cache reader to read row group
* feat: do not fetch data if we have pages in cache
* chore: return if nothing to fetch
* feat: enlarge page cache size
* test: test write read parquet
* test: test cache
* docs: update comments
* test: fix config api test
* feat: cache metrics
* feat: change default page cache size
* test: fix config api test
* feat: bump prost and fix pprof feature compiler errors
* feat: fix compiler errors on tokio-console
* chore: fix compiler errors
* ci: add all features check to ci
* feat: decrease the `page size` if the response message size exceeds the limit
* chore: apply suggestions from CR
* feat: prefer to use adaptive_page_size
* chore: apply suggestions from CR
* feat: Control merge reader by batch size
* test: test heap have large range
* fix: merge one batch
* test: merge many duplicates
* test: test reheap hot
* feat: don't handle empty batch in merge reader
* feat: use fixed error message for unknown error
* feat: return fixed message for internal error as well
* chore: include status code in error message
* test: update tests for asserts of error message
* feat: change status code of some datafusion error
* fix: make CollectRecordbatch an query error
* test: update sqlness results
* feat: RepeatedTask adds execute-first-wait-later behavior.
* feat: add inverval generator for repeate task component
* feat: impl debug for dyn IntervalGenerator trait
* chore: change some words
* chore: instead of complicated way, we add an initial_delay to control task interval
* chore: some improve by pr comment
* feat: opentsdb row protocol
* fix: added commnets for num of rows and failure if output is not of affecetd rows
* fix: added extra 1 to number of columns
* fix: avoided cloning datapoints, took ownership instead
* fix: avoided cloning datapoints, took ownership instead
* fix: changed vecotr slice to vector
* fix: remove clone
* fix: combined datapoints and requests with zip instead of enumerating
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-43-183.us-east-2.compute.internal>
* feat: eable range expr nest
* fix: change range expr rewrite format
* chore: organize range query tests
* chore: change range expr name(e.g. MAX(v) RANGE 5s FILL 6)
* chore: add range query test
* chore: fix code advice
* chore: fix ca
* ci: add copy-image.sh and upload-artifacts-to-s3.sh
* ci: remove unused options in dev build
* ci: use 'upload-artifacts-to-s3.sh' and 'copy-image.sh' in release-cn-artifacts action
* refactor: refine copy-image.sh
* fix: invalid requests created by nyc-taxi
* feat: add timestamp to table name
* style: fix clippy
* chore: re-export deps for client
* fix: wait result
* chore: no need to define a prefix constant
* feat: memtable support filter pushdown to prune primary keys
* fix: switch to next time series when pk not selected
* fix: allow predicate evaluation failure
* fix: some clippy warnings
* fix: panic when no primary key in schema
* feat: cache decoded record batch for primary key
* refactor: use arcswap instead of rwlock
* fix: format toml
* test: test different order
* test: add tests for missing and invalid columns
* fix: do not skip schema validation while missing columns
* chore: use field_columns()
* test: add tests for different column order
* chore: add dbname in region request header for tracking purpose
* chore: fix handle read
* chore: add write meter
* chore: add meter-core to dep
* chore: add converter between RegionRequestHeader and QueryContext & update proto version
* feat: add vector_cache to CacheManager
* feat: cache repeated vectors
* feat: skip decoding pk if output doesn't contain tags
* test: add TestRegionMetadataBuilder
* test: test ProjectionMapper
* test: test vector cache
* test: test projection mapper convert
* style: fix clippy
* feat: do not cache vector if it is too large
* docs: update comment
* feat: merge by heap
* fix: fix heap order
* feat: avoid pop/push next and refactor some functions
* feat: replace merge_batches and fixe tests
* test: add test that a key is deleted
* fix: skip empty batch
* style: clippy
* chore: fix typos
* feat: support greatest function
* feat: make greatest take date_type as input
* fix: move sqlness test into common/function/time.sql
* fix: avoid using unwarp
* fix: use downcast
* refactor: simplify arrow cast
* feat: implement new histogram data model
* feat: use prometheus table format for histogram
* refactor: remove duplicated code
* fix: histogram tag column
* fix: use accumulated count in buckets
* refactor: using row based protocol for otlp WIP
* refactor: use row based writer for otlp.
Also updated row writer for owned keys
* refactor: use row writers for otlp
* test: add integration tests for histogram
* refactor: change le column name
* test: test on_compaction_finished
* fix: avoid submit same region to compact
* feat: persist and recover compaction time window
* test: fix test
* test: sort like result
* feat: added show tables command
* fix(tests): fixed parser and statement unit tests
* chore: implemeted display trait for table type
* fix: handled no tabletype and error for usopprted command in show databse
* chore: removed full as a show kind, instead as a show option
* chore(tests): fixed failing test and added more tests for show full
* chore: refactored table types to use filters
* fix: changed table_type to tables
* feat: RegionMetadataBuilder allow adding/dropping columns multiple times
* test: test add if not exists/drop if exists
* feat: change validator and add need_alter
* test: fix tests and test need_alter
* test: test alter retry
* feat: open before create
* style: fix clippy
* refactor: move RegionOptions to options mod
* refactor: define compaction strategy in region/options.rs
* feat: use duration for time window
* refactor: rename CompactionStrategy to CompactionOptions
* feat: use serde to parse options
* feat: parse options
* feat: set options on creation/opening
* test: test create/open with options
* chore: remove todo
* feat: get compaction ttl and options from RegionOptions
* style: fix clippy
* chore: Remove unused engine_options
* style: fix clippy
* chore: remove todo
* fix: check version before alter region
* chore: apply suggestions from CR
* Update src/mito2/src/worker/handle_alter.rs
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: allow multiple waiters in compaction request
* feat: compaction status wip
* feat: track region status in compaction scheduler
* feat: impl compaction scheduler
* feat: call compaction scheduler
* feat: remove status if nothing to compact
* feat: schedule compaction after flush
* feat: set compacting to false after compaction finished
* refactor: flush status only needs region id and version control
* refactor: schedule_compaction don't need region as argument
* test: test flush/scheduler for empty requests
* test: trigger compaction in test
* feat: notify scheduler on truncated
* chore: Apply suggestions from code review
Co-authored-by: JeremyHi <jiachun_feng@proton.me>
---------
Co-authored-by: JeremyHi <jiachun_feng@proton.me>
* refactor:
1. remove TableIdent, use TableId directly
2. use the latest greptime-proto
3. independently invalidate table id cache and table name cache
* rebase
* fix: resolve PR comments
* fix: resolve PR comments
* chore: set mutable limit to half of the global write buffer size
* refactor: put handle_flush_finished after handle_flush_request
* refactor: rename tests.rs to basic_test.rs
* style: fmt code
* feat: add writable flag to region.
* refactor: rename MitoEngine to MitoEngine::scanner
* feat: add set_writable() to RegionEngine
* feat: check whether region is writable
* feat: make set_writable sync
* test: test set_writable
* docs: update comments
* feat: send result on compaction failure
* refactor: wrap output sender in new type
* feat: on failure
* refactor: use get_region_or/writable_region_or
* refactor: remove send_result
* feat: notify waiters on flush scheduler drop
* test: fix tests
* fix: only alter writable region
* feat: add more info to error messages
* feat: store next column id in procedure
* fix: update next column id for table info
* test: fix add col test
* chore: remove location from invalid request error
* test: update test
* test: fix test
* test: add test for reopen
* feat: last entry id starts from flushed entry id
* fix: store flushed sequence and recover it from manifest
* test: check sequence in alter test
* test: more tests for alter
* feat: impl handle_alter wip
* refactor: move send_result to worker.rs
* feat: skeleton for handle_alter_request
* feat: write requests should wait for alteration
* feat: define alter request
* chore: no warnings
* fix: remove memtables after flush
* chore: update comments and impl add_write_request_to_pending
* feat: add schema version to RegionMetadata
* feat: impl alter_schema/can_alter_directly
* chore: use send_result
* test: pull next_batch again
* feat: convert pb AlterRequest to RegionAlterRequest
* feat: validate alter request
* feat: validate request and alter metadata
* feat: allow none location
* test: test alter
* fix: recover files and flushed entry id from manifest
* test: test alter
* chore: change comments and variables
* chore: fix compiler errors
* feat: add is_empty() to MemtableVersion
* test: fix metadata alter test
* fix: Compaction picker doesn't notify waiters if it returns None
* chore: address CR comments
* test: add tests for alter request
* refactor: use send_result
* feat: compaction component
* feat: mito2 compaction
* Avoid building time range predicates when merge SST files since in TWCS we don't enforce strict time window.
* fix: some CR comments
* minor: change CompactionRequest::senders to an option
* chore: handle compaction finish error
* feat: integrate compaction into region worker
* chore: rebase upstream
* fix: Some CR comments
* chore: Apply suggestions from code review
* style: fix clippy
---------
Co-authored-by: Yingwen <realevenyag@gmail.com>
* refactor:
1. remove method `register_system_table` from CatalogManager
2. the creation of ScriptTable (as a system table) is removed from CatalogManager. Instead, the ScriptTable is created when Frontend instance is starting; and is created by calling Frontend instance's grpc handler.
* rebase
* fix: filter out outdated heartbeat, #1707
* feat: reorder handlers
* refactor: disableXXX to enableXXX
* feat: make full use of region leases to facilitate failover
* chore: minor refactor
* chore: by comment
* feat: logging on inactive/active
* chore: call handle_flush_request
* feat: alias SchedulerRef and clean scheduler on drop
* feat: add scheduler to workers
* feat: remove RegionMemtableStats
* feat: pick regions to flush
* feat: add more fields to region flush task
* feat: smallvec workspace dep
* feat: Use list to hold immutable memtables
* feat: flush job wip
* feat: use access layer to read write sst
* feat: flush memtables to l0
* feat: write manifest
* feat: schedule next flush on success
* feat: schedule flush on success and failure
* feat: add purger to region
* feat: apply edit after flush
* feat: collect stats for SSTs
* feat: manual flush
* test: test flush and fix manifest test
* feat: remove flush scheduler job limit
* fix: typo
* style: clippy
* feat: clean flushed files on failure
* chore: address CR comment
* refactor: Use put_rows
* feat: Clean flush scheduler on drop
* feat: remove region flush status on drop and close
* chore: address CR comment
* feat: alias SchedulerRef and clean scheduler on drop
* feat: add scheduler to workers
* feat: use access layer to read write sst
* feat: add purger to region
* refactor: allow getting region_dir from AccessLayer
* feat: add scheduler to FlushScheduler
* feat: getter for object store
* chore: fix typo
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* feat: only allow timestamp data type as time index
* test: update sqltest cases, todo: need some fixes
* fix: sqlness tests
* fix: forgot adding back cte test
* chore: style
* fix: existing null value for schema name value
* chore: fix null check
* fix: change catalognamevalue and schemanamevalue to option
* fix: fix null case
* chore: update proto
* chore: add try from for schema name value
* chore: merge schema opts to table opts while creating table
* chore: use table ttl opts first
* chore: add unit test
* chore: update proto version
* feat: replay memtable when opening table
* test: region replay
* refactor: save logstore in TestEnv
* fix: some cr comments
* chore: rebase develop
* chore: update last entry id during replay
* chore: change userinfo to query_ctx in http handler
* chore: minor change
* chore: move prometheus http to http mod
* chore: fix uni test:
* chore: add back schema check
* chore: minor change
* chore: remove clone
* refactor: use arrow::compute::concat instead of push values to vector builders
* feat: support projection
* refactor: remove sequence
* refactor: concatenate
* fix: series must not be empty
* refactor: projection
* feat: timestamp types sqlness tests
* feat: adds timestamp tests
* test: add string tests
* test: comment a case in timestamp
* test: add float type tests
* chore: adds TODO
* feat: set TZ=UTC for sqlness test
* feat: Implement slice and first/last timestamp for Batch
* feat(mito): implements sort/concat for Batch
* chore: fix typo
* chore: remove comments
* feat: sort and dedup
* test: test batch operations
* chore: cast enum to test op type
* test: test filter related api
* sytle: fix clippy
* docs: comment for slice
* chore: address CR comment
Don't return Option in get_timestamp()/get_sequence()
* feat: time series memtable
* feat: add some test
* fix: some clippy warnings
* chore: some rustdoc
* refactor: test
* fix: remove useless functions
* feat: add config for TimeSeriesMemtable
* chore: some optimize
* refactor: remove bucketing
* refactor: avoid cloing RegionMetadataRef across all Series; make initial_builder_capacity a const; sort batch only by timestamp and sequence
* feat: rewrite do_get for streaming get flight data
* feat: rewrite do_get call stack but leave the async stream adapter not modified yet
* feat: rewrite the async stream adapter to accept greptime record batch stream
* fix: resolve some PR comments
* feat: rewrite tests to adapt to the streaming do_get
* feat: add unit tests for streaming do_get
* feat: rewrite timer metric of merge scan
* remove unhelpful unit tests for streaming do_get
* add a new metric timer for merge scan and fix some test errors
* rewrite mysql writer to write query results in a streaming manner
* fix: fix fmt errors
* fix: rewrite sqlness runner to take into account the streaming do_get
* fix: fix toml format errors
* fix: resolve some PR comments
* fix: resolve some PR comments
* fix: refactor do_get to increase readability
* fix: refactor mysql try_write_one to increase readability
* fix: fix ddl client can not update leader addr
* chore: apply suggestions from CR
* feat: add message to context
* fix: only retry if unavailable or deadline exceeded
* chore: apply suggestions from CR
* feat: datanode's row insrter
* refactor: ExprFactory
* feat: row inserter in standalon mode
* chore: minor refactor
* feat: influxdb line protocol's row protocol
* chore: minor refactor
* improve: avoid to use too many string
* no longer async
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* chore: do not check empty data
* chore: by review comment
* chore: by comment
* chore: by review comment
* chore: by review comment
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* move prometheus routes to default http server
Signed-off-by: sh2 <shawnhxh@outlook.com>
* fix ci test and remove the server logic of prometheus
* remove unused import and prometheus relevant code
* fix ci: rustfmt and test
* fix ci: silly fmt
* fix ci: silly silly fmt
* change `/prom_store` back to `/prometheus`
* remove unused variable
---------
Signed-off-by: sh2 <shawnhxh@outlook.com>
* refactor: KeyValues return ValueRef
* 1. Change KeyValues returned value from pb value to ValueRef
2. Replace OpType/SemanticType with pb's OpType and SemanticType to avoid duplicated conversions.
* feat: define min value of OpType as a const
* fix: toml format
* feat: remove greptimedb-telemetry feature
* feat: adds enable_telemetry option to metasrv and datanode
* refactor: move data_home from file config to storage config
* feat: store the installation uuid into datanode and metasrv working home
* fix: cargo toml fmt
* test: ignore region failver test when using local fle storage
* test: ignore telemetry reporter in test mode
* feat: print warning log when enabling telemetry
* chore: the telemetry doc link
* chore: remove enable_telemetry from datanode example config file
* refactor: rename GREPTIMEDB_TELEMETRY_CLIENT_REQUEST_TIMEOUT
* chore: rename print_warn_log to print_anonymous_usage_data_disclaimer
* ci: add context argument in build-greptime-binary action
* refactor: add 'working-dir' in upload-artifacts action and rename 'context' to 'working-dir'
* refactor: use timestamp as part of image tag when trigger manually
* fix(timestamp): add trim for the input date string
* fix(timestamp): add analyzer rule to trim strings before conversion
* fix: adjust according to CR
* chore: add version reporter
* chore: add uuid for version report
* chore: add file license
* chore: format code
* chore: fix by pr comment
* chore: change version report api url
* chore: change greptimedb opentelemetry crate name
* chore: minor code beautification
* chore: add keys only option when range etcd
* chore: fix by pr comment
* chore: fix by pr comment
* chore: change uuid file location
* chore: only run telemetry in meta leader
* chore: add more test and some minor fix
* chore: make clippy happy
* chore: fix by pr comment
* chore: fix by pr comment
* chore: add debug log for greptimedb telemetry
* feat: add exists api into KvBackend
* refactor: region lease
* feat: fiter out inactive node in keep-lease
* feat: register&deregister inactive node
* chore: doc
* chore: ut
* chore: minor refactor
* feat: use memory_kv to store inactive node
* fix: use real error in
* chore: make inactive_node_manager's func compact
* chore: more efficiently
* feat: clear inactive status on cadidate node
* refactor: unify the make targets of building images
* refactor: make Dockerfile more clean
1. Add dev-builder image to build greptime binary easily;
2. Add 'docker/ci/Dockerfile-centos' to release centos image;
3. Delete Dockerfile of aarch64 and just need to use one Dockerfile;
Signed-off-by: zyy17 <zyylsxm@gmail.com>
---------
Signed-off-by: zyy17 <zyylsxm@gmail.com>
* feat: define structs for version
* feat: Build region from metadata and memtable builder
* feat: impl validate for metadata
* feat: add more fields to RegionMetadata
* test: more tests
* test: more check and test
* feat: allow overwriting version
* style: fix clippy
* chore: remove useless Option type in plugins (#1544)
Co-authored-by: paomian <qtang@greptime.com>
* chore: remove useless Option type in plugins (#1544)
Co-authored-by: paomian <qtang@greptime.com>
* chore: remove useless Option type in plugins (#1544)
Co-authored-by: paomian <qtang@greptime.com>
* chore: remove useless Option type in plugins (#1544)
Co-authored-by: paomian <qtang@greptime.com>
* feat: first commit for time type
* feat: impl time type
* fix: arrow vectors type conversion
* test: add time test
* test: adds more tests for time type
* chore: style
* fix: sqlness result
* Update src/common/time/src/time.rs
Co-authored-by: Lei, HUANG <6406592+v0y4g3r@users.noreply.github.com>
* chore: CR comments
---------
Co-authored-by: localhost <xpaomian@gmail.com>
Co-authored-by: paomian <qtang@greptime.com>
Co-authored-by: Lei, HUANG <6406592+v0y4g3r@users.noreply.github.com>
refactor: move heartbeat configuration into an independent section in config file
* refactor: move heartbeat configuration into an independent section in config file
* feat: add HeartbeatOptions struct
* test: modify corresponding test case
* chore: modify corresponding example file
* feat: support append entries from multiple regions at a time
* chore: add some tests
* fix: false postive mutable_key warning
* fix: append_batch api
* fix: remove unused clippy allows
* chore(prom): rename prometheus(remote storage) to prom-store and promql(HTTP server) to prometheus
* chore: apply clippy suggestions
* chore: adjust format according to rustfmt
* feat: meta procedure options
* chore: tune meta procedure options in tests
* Update src/common/procedure/Cargo.toml
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat(config-endpoint): add initial implementation
* feat: add initial handler implementation
* fix: apply clippy suggestions, use axum response instead of string
* feat: address CR suggestions
* fix: minor adjustments in formatting
* fix: add a test
* feat: add to_toml_string method to options
* fix: adjust the assertion for the integration test
* fix: adjust expected indents
* fix: adjust assertion for the integration test
* fix: improve according to clippy
* feat(http_body_limit): add initial support for DefaultBodyLimit
* fix: address CR suggestions
* fix: adjust the const for default http body limit
* fix: adjust the toml_str for the test
* fix: address CR suggestions
* fix: body_limit units in example config toml files
* fix: address clippy suggestions
* feat: initial twcs impl
* chore: rename SimplePicker to LeveledPicker
* rename some structs
* Remove Compaction strategy
* make compaction picker a trait object
* make compaction picker configurable for every region
* chore: add some test for ttl
* add some tests
* fix: some style issues in cr
* feat: enable twcs when creating tables
* feat: allow config time window when creating tables
* fix: some cr comments
* feat: add log
* feat: print more info
* feat: use chain reader
* fix: panic on getting first range
* fix: prev not updated
* fix: reverse readers and iter backward
* chore: don't print windows in log
* feat: consider memtable range
Also fix the issue that using incorrect comparision method to sort time
ranges.
* fix: merge memtable window with sst's
* feat: add use_chain_reader option
* feat: skip empty memtables
* chore: change log level
* fix: memtable range not ordered
* style: fix clippy
* chore: address review comments
* chore: print region id in log
* feat: txn for meta kvstore
* feat: txn
* chore: add unit test
* chore: more test
* chore: more test
* Update src/meta-srv/src/service/store/memory.rs
Co-authored-by: LFC <bayinamine@gmail.com>
* chore: by cr
---------
Co-authored-by: LFC <bayinamine@gmail.com>
* refactor: add table_id to get_table()/table_exists()
* refactor: Add table_id to alter table request
* refactor: Add table id to DropTableRequest
* refactor: add table id to DropTableRequest
* refactor: Use table id as key for the tables map
* refactor: use table id as file engine's map key
* refactor: Remove table reference from engine's get_table/table_exists
* style: remove unused imports
* feat!: Add table id to TableRegionalValue
* style: fix cilppy
* chore: add comments and logs
* feat: support to copy from orc format
* test: add copy from orc test
* chore: add license header
* refactor: remove unimplemented macro
* chore: apply suggestions from CR
* chore: bump orc-rust to 0.2.3
* feat: add initial implementation for status endpoint
* feat(status_endpoint): add more data to response
* feat(status_endpoint): use build data env vars
* feat(status_endpoint): add simple test
* fix(status_endpoint): adjust the toml indentation
* fix: set max_files_in_l0 in unit tests to avoid compaction
* refactor: pass while EngineConfig
* fix: comment out unstable sqlness test
* revert commented sqlness
* add some debug log
* fix: use lazy parquet reader in MitoTable::scan_to_stream to avoid IO in plan stage
* fix: unit tests
* fix: order-by optimization
* add some tests
* fix: move metric names to metrics.rs
* fix: some cr comments
* refactor: Remove MySQL related options from Datanode
remove mysql_addr and mysql_runtime_size in datanode.rs, remove command line argument mysql_addr in cmd/src/datanode.rs
#1739
* feat: remove --mysql-addr from command line
in pre commit, sqlness can not find --mysql-addrr, because we remove it
issue#1739
* refactor: remove --mysql-addr from command line
in pre commit, sqlness can not find --mysql-addrr, because we remove it
issue#1739
* feat: Add column supports at first or after the existing columns
* Update src/common/query/Cargo.toml
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: add SqlDialect to query context
* feat: use session in postgrel handlers
* chore: refactor sql dialect
* feat: use different dialects for different sql protocols
* feat: adds GreptimeDbDialect
* refactor: replace GenericDialect with GreptimeDbDialect
* feat: save user info to session
* fix: compile error
* fix: test
* feat: using distributed lock to guard against the concurrent updating of table metadatas in region failover procedure
* fix: resolve PR comments
* fix: resolve PR comments
* feat(servers): Export process metrics
* chore: update metrics related deps to get the process-metrics printed
The latest process-metrics crate depends on metrics 0.21, we use metrics
0.20. This cause the process-metrics crate doesn't record the metrics
when use metrics macros
When I use docker build to build the image, I get an error that pip is missing. Add install python3-pip in Dockerfile.
Fixes: #1643
Signed-off-by: yaoyinnan <yaoyinnan@foxmail.com>
* fix: fix close region issue
* chore: apply suggestion from CR
* chore: apply suggestion from CR
* chore: apply suggestion from CR
* chore: apply suggestion from CR
* refactor: remove close method from Region trait
* chore: remove PartialEq from CloseTableResult
* feat(storage): Add AllocTracker
* feat(storage): flush request wip
* feat(storage): support global write buffer size
* fix(storage): Test and fix size based strategy
* test(storage): Test AllocTracker
* test(storage): Test pick_by_write_buffer_full
* docs: Add flush config example
* test(storage): Test schedule_engine_flush
* feat(storage): Add metrics for write buffer size
* chore(flush): Add log when triggering flush by global buffer
* chore(storage): track allocation in update_stats
* feat: add timezone info to query context
* feat: parse mysql compatible time zone string
* feat: add method to timestamp for rendering timezone aware string
* feat: use timezone from session for time string rendering
* refactor: use querycontectref
* feat: implement session/timezone variable read/write
* style: resolve toml format
* test: update tests
* Apply suggestions from code review
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* Update src/session/src/context.rs
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* refactor: address review issues
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* fix(table-procedure): on_register_catalog should use open_table
* test: Test recover RegisterCatalog state
* test: Fix subprocedure does not execute in test
* feat(mito): adjust procedure log level
* refactor: rename execute_parent_procedure
execute_parent_procedure -> execute_until_suspended_or_done
* feat: add delete WAL in drop_region
* chore: fix typo err.
* feat: mark all SSTs deleted and remove the region from StorageEngine's region map.
* test: add test_drop_region for StorageEngine.
* chore: make clippy happy
* fix: fix conflict
* chore: CR.
* chore: CR
* chore: fix clippy
* fix: temp file life time
* fix: remove region number validation
* Update src/mito/src/engine.rs
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: support frontend heartbeat
* fix: typo "reponse" -> "response"
* add ut
* enable start heartbeat task
* chore: frontend id is specified by metasrv, not in the frontend startup parameter
* fix typo
* self-cr
* cr
* cr
* cr
* remove unnecessary headers
* use the member id in the header as the node id
* feat: Add FlushPicker
* feat(storage): Add close to StorageEngine
* style(storage): fix clippy
* feat(storage): Close regions in StorageEngine::close
* chore(storage): Clear requests on scheduler stop
* test(storage): Test flush picker
* feat(storage): Add metrics for auto flush
* feat(storage): Add flush reason and record it in metrics
* feat: Expose flush config
docs(config): Update config example
* refactor(storage): Run auto flush task in FlushScheduler
* refactor(storage): Add FlushItem trait to make FlushPicker easy to test
* feat: add storage engine region count gauge
* test: remove catalog metrics because we can't get a correct number
* feat: add metrics for log store write and compaction
* fix: address review issues
* test: simplify countdownlatch
* feat: impl Drop for LocalScheduler
* feat(storage): Impl FlushRequest and FlushHandler
* feat(storage): Use scheduler to handle flush job
* chore(storage): remove unused code
* feat(storage): Use new type pattern for RegionMap
* feat(storage): Remove on_success callback
* feat(storage): Address CR comments and add some metrics to flush
* feat(common-procedure): pub(crate) use proc_path
* feat(common-procedure): Implement delete_procedure
* feat(common-procedure): Clean procedure after it is finished
* chore(common-procedure): put path_string in front of try_stream
* test(common-procedure): Test cleaning up procedures
* feat(common-procedure): Clean procedure states in recover()
* feat(common-procedure): Use VecDeque for finished procedures
* fix: incorrect show create table output
* feat: change CreateTable's Display if table is external
* feat: change CreateTable's Display if table is external
* refactor: refactor copy from executor
* feat: support to copy from CSV and JSON format files
* feat: support to copy table to the CSV and JSON format file
* test: add tests copy from/to
* chore: apply suggestions from CR
* refactor: id first in pusher_key
* feat: is_acceptable for multi roles
* feat: mailbox
* fix: channel for mailbox
* feat: impl mailbox via heartbeat
* chore: add unit test for mailbox
* chore: by cr
* chore: typo
* chore: refactor the mailbox API
* chore: br cr
* chore: check timeout interval to 10ms
* chore: add response header
* feat: add support for information_schema.columns
* feat: remove information_schema from its view
* Update src/catalog/src/information_schema.rs
Co-authored-by: LFC <bayinamine@gmail.com>
* fix: error on table data type
* test: correct sqlness test for information schema
* test: add information_schema.columns sqlness tests
---------
Co-authored-by: LFC <bayinamine@gmail.com>
* test(storage): use assert_eq to check scan result
* feat(storage): Add more info to manifest log
* feat: Avoid error log when unable to delete
* fix: The manifest gc task should delete files <= last_version
* feat(storage): Don't log if the error kind is not found
* feat: Add keep_last_checkpoint option
* feat: improve and distinguish different errors for IllegalInsertData
* feat: change error code for UnexpectedValuesLength and ColumnAlreadyExists
* chore: improve readability of error message
* feat: add metrics for various interfaces
* feat: add db label for protocols
* feat: add postgres protocol metrics
* feat: add metrics for grpcs apis
* feat: add auth failure counter for mysql/pg
* fix: add db label to grpc prometheus interface
* Apply suggestions from code review
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat: add error code for auth failure counter
* fix: use schema as dbname when catalog is default
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* fix(mito): Add metrics to mito DDL procedure
* feat(mito): Use procedure's implementation to create table
* feat(mito): Use procedure's implementation to alter table
* feat(mito): Use procedure's implementation to drop table
* style(mito): Fix clippy
* test(mito): Fix tests
* feat(mito): Add TableCreator
* feat(mito): update alter table procedure
* fix(mito): alter procedure create alter op first
* feat(mito): Combine alter table code
* fix(mito): Fix deadlock
* feat(mito): Simplify drop table procedure
* build: Download assets to cargo output dir
Also remove the output from the build script and only print the output
on failure
* chore: Update src/servers/build.rs
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* build: replace pushd by cd
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* feat: implement load_options.
* refactor: build by ConfigOptions.
* refactor: init_global_logging by LoggingOptions.
* chore: make clippy happy.
* refactor: use TopLevelOptions push top level options to subcommand.
* test: test TopLevelOptions.
* refactor: push Options in Box.
* refactor: push Options in Box.
* refactor: use let-else and Options.
* feat: Remove create_mock_sql_handler()
create_to_request() and alter_to_request() don't need `&self`, so
we don't need to mock the sql handler to test them
* feat: Enable procedure manager by default
* docs: Update config example
* test: Enable procedure framework in all tests
* refactor(datanode): rename methods using procedure
* test(catalog): Fix temp dir drops before test finishes
* tests: Enable procedure framework in sqlness
* test: Fix sqlness standalone rename test
* fix: Drop procedure allows table not in engine
* test: Change rename table test
* fix: add options to table meta when creating table by procedure
* test: adjust error message in schema test case
* test: Fix test_sql_api error message
* fix: frontend opt should respect http addr in config file when no command options is given
* refactor: command line options should be Option<bool>
* fix: ci
* feat: support creating the physical plan for JSON and CSV files
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* refactor(file-table-engine): use datasource Format instead
* refactor: Add table_ref() to requests as their methods
* feat: Add CreateImmutableFileTable
* feat: Add DropImmutableFileTable
* feat: Implement TableEngineProcedure for ImmutableFileTableEngine
* feat: Add common-procedure-test crate
* refactor: mito engine use common-procedure-test to test procedures
* test: Add test for create and drop table
* chore: Address review comments
* feat: Add drop table procedure
* feat: support dropping table by procedure on datanode
* test: Add test for DropTableProcedure
* test: Test drop table by procedure
* chore: update comments
* fix: Make on_remove_from_catalog idempotent
* feat: add table id and engine to informatin_schema.TABLES
* Update src/catalog/src/information_schema/tables.rs
Co-authored-by: Yingwen <realevenyag@gmail.com>
* chore: change table_engine to engine
* test: update sqlness for information schema
* test: update information_schema test in frontend::tests::instance_test.rs
* fix: github action sqlness information_schema test fail
* test: ignore table_id in information_schema
* test: support distribute and standalone have different output
---------
Co-authored-by: Yingwen <realevenyag@gmail.com>
* Add the cache hit/miss counter
* Verify the cache metrics are included
* Resolve comments
* Rename the error kind label name to be consistent with other metrics
* Rename the object store metric names
* Avoid using glob imports
* Format the code
* chore: Update src/object-store/src/metrics.rs mod doc
---------
Co-authored-by: Yingwen <realevenyag@gmail.com>
* chore: add some metrics for grpc client
* chore: add grpc preix and change metrics-exporter-ptometheus to add global prefix
---------
Co-authored-by: paomian <qtang@greptime.com>
* ci: set whether it is the latest release by using 'ncipollo/release-action'
* ci: modify greptimedb install script to use the latest nightly version binary
* feat: implement predict_linear function in promql
* feat: initialize predict_linear's planner
* fix(bug): fix a bug in linear regression and add some unit test for linear regression
* chore: format code
* feat: deal with NULL value in linear_regression
* feat: add test for all value is None
* feat(promql): add holt_winters initial implementation
* feat(promql): improve docs for holt_winters
* feat(promql): adjust holt_winters implementation according to code review
* feat(promql): add holt_winters test from prometheus promql function test suite
* feat(promql): add holt_winters more tests from prometheus promql function test suite
* feat(promql): fix styling issue
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* fix: not allowed to create column name same with keyword without quoted
* fix: tests
* Update src/sql/src/parsers/create_parser.rs
Co-authored-by: Ning Sun <classicning@gmail.com>
* fix: tests
---------
Co-authored-by: Ning Sun <classicning@gmail.com>
* feat(compaction_time_window): initial changes for compaction_time_window field support
* feat(compaction_time_window): move PickerContext creation
* feat(compaction_time_window): update region descriptor, fix formatting
* feat(compaction_time_window): add minor enhancements
* feat(compaction_time_window): fix failing test
* feat(compaction_time_window): return an error instead silently skip for the user provided compaction_time_window
* feat(compaction_time_window): add TODO reminder
* wip: use
* rebase develop
* chore: fix typos
* feat: replace export parquet writer with buffered writer
* fix: some cr comments
* feat: add sst_write_buffer_size config item to config how many bytes to buffer before flush to underlying storage
* chore: reabse onto develop
* chore: add http metrics server in datanode node when greptime start in distributed mode
* chore: add some docs and license
* chore: change metrics_addr to resolve address already in use error
* chore add metrics for meta service
* chore: replace metrics exporter http server from hyper to axum
* chore: format
* fix: datanode mode branching error
* fix: sqlness test address already in use and start metrics in defualt config
* chore: change metrics location
* chore: use builder pattern to builder httpserver
* chore: remove useless debug_assert macro in httpserver builder
* chore: resolve conflicting build error
* chore: format code
* feat: wip
* fix: Fix CreateMitoTable::table_schema not initialized from json
* feat: Implement AlterMitoTable procedure
* test: Add test for alter procedure
* feat: Register alter procedure
* fix: Recover procedures after catalog manager is started
* feat: Simplify usage of table schema in create table procedure
* test: Add rename test
* test: Add drop columns test
* test: Add compaction test
* test: Test read during compaction
* test: Add s3 object store to test
* test: only run compact test
* feat: Hold file handle in chunk stream
* test: check files still exist after compact
* feat: Revert changes to develop.yaml
* test: Simplify MockPurgeHandler
* feat: add dbname and health check for grpc api
* refactor: move health check to dedicated service
* chore: switch to merged proto rev
* feat: implement healthcheck on server-side
* fix: returns None if parquet file does not contain any rows
* fix: skip empty parquet file
* chore: add doc
* rebase develop
* fix: use flatten instead of filter_map with identity
* feat(to_unixtime): add initial implementation
* feat(to_unixtime): use Timestamp for conversion
* feat(to_unixtime): implement conversion to Result<VectorRef>
* feat(to_unixtime): make unit test pass
* feat(to_unixtime): preserve None for invalid timestamps
* feat(to_unixtime): address code review suggestions
* feat(to_unixtime): add an sqlness test
* feat(to_unixtime): adjust the assertion for the sqlness test
* Update tests/cases/standalone/common/select/dummy.sql
---------
Co-authored-by: Ruihang Xia <waynestxia@gmail.com>
* ci: install python3 and python3-dev in CI Dockerfile
* ci: release the standalone binaries with pyo3 support for multiple platforms
* refactor: install pip and pyarrow
* refactor: specify the python version
* fix: use correct env var
* fix: move COPY up so rustup know it's nightly
* fix: add `pyo3_backend` in GHA yml
* chore: name for `TODO`
* temp: not set `pyo3_backend` before find DSO
* fix: release linux with pyo3_backend
* feat: implement promql query on grpc
* test: resolve test errors
* test: add tests for promql grpc api
* refactor: align prom object name with proto
* chore: switch proto revision to main
* fix: make pyo3 optional again
* Update src/script/Cargo.toml
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* feat(SST): use a newType named FileId for FileMeta
* chore: rename some functions
* fix: compatible for previous FileMeta format
* fix: alias for file_id when getting deserialized
* feat: mysql prepare by replace ? in sql to param
* chore: mysql prepare statment support time param
* chore: prepare test more types
* chore: add TODO
* feat: add take index method for VectorOp
* chore: make clippy happy
* chore: make clippy happy
* chore: improve the code
* chore: improve the code
* chore: add take null test
* chore: fix clippy
* feat: track disk usage of regions
Signed-off-by: Zheming Li <nkdudu@126.com>
* calculate disk usage when call
* add default on file meta
---------
Signed-off-by: Zheming Li <nkdudu@126.com>
* refactor: change from urlencoded to regex
* refactor: change from urlencoded to regex
* chore: add unit test
* chore: update comment
* chore: remove local benchmark test
* chore: minor fix
* chore: remove unused dep
* docs(contributingmd): add run tests commands
* docs(contributingmd): add link to nextest website
Co-authored-by: dennis zhuang <killme2008@gmail.com>
---------
Co-authored-by: dennis zhuang <killme2008@gmail.com>
* fix: apply ttl and write_buffer_size options when a table is created via procedure
* fix: address code review suggestion
* fix: use borrowing of table_options correctly
* docs: Add comments to standalone config example
* docs: Add comments to datanode config example
* docs: Add comments to frontend config example
* docs: Add comments to meta-srv config example
* docs: Use "GB" instead of "GiB"
* docs: Add link to the selector doc
* docs: Fix grammar
* docs: Change comment position
* refactor(procedure): Store error in ProcedureState
* test: Mock instance with procedure enabled
* feat: Add wait method to wait for procedure
* test(datanode): Test create table by procedure
* chore: Fix clippy
* fix: using schema instead of full database
* fix: using schema instead of full database
* fix: using schema instead of full database
* chore: add debug log
* chore: remove debug log
* chore: remove debug log
* chore: fix cr
* fix: Serialize FrontendOptions to toml
* fix: Serialize DatanodeOptions to toml
* fix: Serialize StandaloneOptions to toml
See https://users.rust-lang.org/t/why-toml-to-string-get-error-valueaftertable/85903/2
* chore!: Rename MetaClientOpts to MetaClientOptions
BREAKING CHANGE: Change the meta_client_opts in the config file to
meta_client_options
* chore: get header in grpc & temp save
* chore: change authscheme to include data str
* chore: add auth to grpc flight handler
* chore: add unit test & hold for now since grpc api doesnt accept req input
* chore: minor change
* chore: minor change
* chore: add flight context to database interface
* chore: add test
* chore: update proto version & fix cr issue
* chore: add test
* chore: minor update
* feat: catalog list
* feat: catalog list
* feat:api
* feat: leader info
* feat: use constant
* fix: ci
* feat: query heartbeat by ip
* ut: add test
* fix: cr
* fix: cr
* fix: cr
* refactor: Use watch channel to store ProcedureState
* feat: Add a watcher to wait for state change
* test: test watcher on procedure failure
* feat: Only clear message cache on success
* feat: submit returns Watcher
* feat: change table options from string map to a struct, add ttl and write_buffer_size
* fix: also pass table options to table meta
* feat: pass table options when opening/creating regions
* fix: CR comments
* feat: wip
* feat: Implement procedure to create mito table
* feat: Add create_table_procedure to TableEngine
* feat: Impl dump and lock for CreateMitoTable
* feat: Impl CreateMitoTable::execute and register it to manager
* feat(common-procedure): pub local mod
* feat: Add simple test for MitoCreateTable
* style: Fix clippy
* refactor: Move create_table_procedure to a new trait TableEngineProcedure
* initial impl
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* minor (useless) refactor
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* retrieve metric name
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* add time index column to group by columns
filter out NaN in normalize
remove NULL in instant manipulator
accept form data as HTTP params
correct API URL
accept second literal as step param
* happy clippy
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* update test result
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
---------
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
* refactor: make schedule request return value generic
* feat: add handler trait
* wip
* feat: use task handler
* fix: unit test
* refactor: separate scheduler mod
* chore: rename
* chore: Request use associate type
* refactor: use associate type
* refactor: use associate type to reduce generic parameters
* chore: further remove generic types
* chore: further remove a generic parameter
* feat: make args in coprocessor optional
* feat: supports kwargs for coprocessor as params passed by the users
* feat: supports params for /run-script
* fix: we should rewrite the coprocessor by removing kwargs
* fix: remove println
* fix: compile error after rebasing
* fix: improve http_handler_test
* test: http scripts api with user params
* refactor: tweak all to_owned
* feat: Add ContextProvider to Context
So procedures can query states of other procedures via the
ContextProvider and they don't need to hold a ProcedureManagerRef
* feat: Procedure supports acquring multiple lock keys
* test: Use multi-locks in test
* feat: Add keys_to_lock/unlock
* fix: correct date/time type format for postgresql
* fix: tests for timestamp
* refactor: use Utc datetime for timestamp::to_chrono_datetime
* Update src/servers/Cargo.toml
Co-authored-by: LFC <bayinamine@gmail.com>
---------
Co-authored-by: LFC <bayinamine@gmail.com>
* feat: trigger compaction on flush
* chore: rebase develop
* feat: add config item max_file_in_level0 and remove compaction_after_flush
* fix: cr comments
* chore: add unit test to cover Timestamp::new_inclusive
* fix: workaround to fix future is not Sync
* fix: future is not sync
* fix: some cr comments
* add DistLock trait and a implement based etcd
wip
impl lock grpc service for meta-srv
reuse the etcd client instead of repeatedly creating etcd client
add some docs and comments
add some comment
meta client support distribute lock
fix: dead lock
self-cr
* cr
* rename "expire" -> "expire_secs"
* feat: Runner executes procedure
* feat: Add rollback key type to ParsedKey
* feat: Write rollback key when procedure is unable to execute
* feat: Use loaded step to re-submit subprocedure
* feat: Track subprocedures in ProcedureMeta
* feat: Clean message cache after the root procedure is done
* feat: Runner returns execution result
* fix: Fix tests
* test: Test Runner
* test: Test procedures_in_tree
* chore: Refine test and comments
* feat: Remove support of lock inheritance
A deadlock happens if a subprocedure acquires the same lock key as
its parent.
The main concern is if the subprocedure directly inherits its parent's
lock, then how should we behave when multiple subprocedures acquire
this same lock? Each procedure may assume it has unique access to the
same object but it actually shares the resource with others.
Now subprocedures need to use different keys to lock objects, which is
reasonable. For example:
- A parent procedure wants to create a table so it locks the table with
a key like `catalog.schema.table`
- Subprocedures create regions for the table so they lock the regions
with keys `catalog.schema.table.region-0 ~ catalog.schema.table.region-n`
* style: Fix clippy
* feat: insert_procedure returns false on duplicate procedure
Also rename this method to try_insert_procedure
* chore: Address CR comments
* feat: implement "drop table" in distributed mode (both in SQL and gRPC)
refactor: create distributed table
some details:
- set table global value in Meta, as well as table routes value. Datanode only set table regional value
- complete instance SQL tests both in standalone and distributed mode
* fix: rebase develop
* fix: resolve PR comments
* feat: enable caching when using object store
* feat: support file cache for object store
* feat: maintaining the cached files with lru
* fix: improve the code
* empty commit
* improve the code
# - If it's a tag push release, the version is the tag name(${{ github.ref_name }});
# - If it's a scheduled release, the version is '${{ env.NEXT_RELEASE_VERSION }}-nightly-$buildTime', like 'v0.2.0-nightly-20230313';
# - If it's a manual release, the version is '${{ env.NEXT_RELEASE_VERSION }}-$(git rev-parse --short HEAD)-YYYYMMDDSS', like 'v0.2.0-e5b243c-2023071245';
# - If it's a nightly build, the version is 'nightly-YYYYMMDD-$(git rev-parse --short HEAD)', like 'nightly-20230712-e5b243c'.
release-dev-builder-images-cn: # Note: Be careful issue:https://github.com/containers/skopeo/issues/1874 and we decide to use the latest stable skopeo container.
# 1. The tag('v*.*.*') push release: the release workflow will be triggered by the tag push event.
# 2. The scheduled release(the version will be '${{ env.NEXT_RELEASE_VERSION }}-nightly-YYYYMMDD'): the release workflow will be triggered by the schedule event.
on:
push:
tags:
@@ -5,246 +10,406 @@ on:
schedule:
# At 00:00 on Monday.
- cron:'0 0 * * 1'
workflow_dispatch:
name:Release
workflow_dispatch:# Allows you to run this workflow manually.
# Notes: The GitHub Actions ONLY support 10 inputs, and it's already used up.
inputs:
linux_amd64_runner:
type:choice
description:The runner uses to build linux-amd64 artifacts
default:ec2-c6i.4xlarge-amd64
options:
- ubuntu-20.04
- ubuntu-20.04-8-cores
- ubuntu-20.04-16-cores
- ubuntu-20.04-32-cores
- ubuntu-20.04-64-cores
- ec2-c6i.xlarge-amd64# 4C8G
- ec2-c6i.2xlarge-amd64# 8C16G
- ec2-c6i.4xlarge-amd64# 16C32G
- ec2-c6i.8xlarge-amd64# 32C64G
- ec2-c6i.16xlarge-amd64# 64C128G
linux_arm64_runner:
type:choice
description:The runner uses to build linux-arm64 artifacts
default:ec2-c6g.4xlarge-arm64
options:
- ec2-c6g.xlarge-arm64# 4C8G
- ec2-c6g.2xlarge-arm64# 8C16G
- ec2-c6g.4xlarge-arm64# 16C32G
- ec2-c6g.8xlarge-arm64# 32C64G
- ec2-c6g.16xlarge-arm64# 64C128G
macos_runner:
type:choice
description:The runner uses to build macOS artifacts
default:macos-latest
options:
- macos-latest
skip_test:
description:Do not run integration tests during the build
type:boolean
default:true
build_linux_amd64_artifacts:
type:boolean
description:Build linux-amd64 artifacts
required:false
default:false
build_linux_arm64_artifacts:
type:boolean
description:Build linux-arm64 artifacts
required:false
default:false
build_macos_artifacts:
type:boolean
description:Build macos artifacts
required:false
default:false
build_windows_artifacts:
type:boolean
description:Build Windows artifacts
required:false
default:false
publish_github_release:
type:boolean
description:Create GitHub release and upload artifacts
required:false
default:false
release_images:
type:boolean
description:Build and push images to DockerHub and ACR
required:false
default:false
# Use env variables to control all the release process.
env:
RUST_TOOLCHAIN:nightly-2022-12-20
# The arguments of building greptime.
RUST_TOOLCHAIN:nightly-2023-10-21
CARGO_PROFILE:nightly
# FIXME(zyy17): Would be better to use `gh release list -L 1 | cut -f 3` to get the latest release version tag, but for a long time, we will stay at 'v0.1.0-alpha-*'.
SCHEDULED_BUILD_VERSION_PREFIX:v0.1.0-alpha
# Controls whether to run tests, include unit-test, integration-test and sqlness.
- name:Configure scheduled build version# the version would be ${SCHEDULED_BUILD_VERSION_PREFIX}-YYYYMMDD-${SCHEDULED_PERIOD}, like v0.1.0-alpha-20221119-weekly.
Thanks a lot for considering contributing to GreptimeDB. We believe people like you would make GreptimeDB a great product. We intend to build a community where individuals can have open talks, show respect for one another, and speak with true ❤️. Meanwhile, we are to keep transparency and make your effort count here.
Read the guidelines, and they can help you get started. Communicate with respect to developers maintaining and developing the project. In return, they should reciprocate that respect by addressing your issue, reviewing changes, as well as helping finalize and merge your pull requests.
Please read the guidelines, and they can help you get started. Communicate with respect to developers maintaining and developing the project. In return, they should reciprocate that respect by addressing your issue, reviewing changes, as well as helping finalize and merge your pull requests.
Follow our [README](https://github.com/GreptimeTeam/greptimedb#readme) to get the whole picture of the project. To learn about the design of GreptimeDB, please refer to the [design docs](https://github.com/GrepTimeTeam/docs).
@@ -21,7 +21,7 @@ Pull requests are great, but we accept all kinds of other help if you like. Such
- Write tutorials or blog posts. Blog, speak about, or create tutorials about one of GreptimeDB's many features. Mention [@greptime](https://twitter.com/greptime) on Twitter and email info@greptime.com so we can give pointers and tips and help you spread the word by promoting your content on Greptime communication channels.
- Improve the documentation. [Submit documentation](http://github.com/greptimeTeam/docs/) updates, enhancements, designs, or bug fixes, and fixing any spelling or grammar errors will be very much appreciated.
- Present at meetups and conferences about your GreptimeDB projects. Your unique challenges and successes in building things with GreptimeDB can provide great speaking material. We'd love to review your talk abstract, so get in touch with us if you'd like some help!
- Submit bug reports. To report a bug or a security issue, you can [open a new GitHub issue](https://github.com/GrepTimeTeam/greptimedb/issues/new).
- Submitting bug reports. To report a bug or a security issue, you can [open a new GitHub issue](https://github.com/GrepTimeTeam/greptimedb/issues/new).
- Speak up feature requests. Send feedback is a great way for us to understand your different use cases of GreptimeDB better. If you want to share your experience with GreptimeDB, or if you want to discuss any ideas, you can start a discussion on [GitHub discussions](https://github.com/GreptimeTeam/greptimedb/discussions), chat with the Greptime team on [Slack](https://greptime.com/slack), or you can tweet [@greptime](https://twitter.com/greptime) on Twitter.
## Code of Conduct
@@ -49,40 +49,40 @@ GreptimeDB uses the [Apache 2.0 license](https://github.com/GreptimeTeam/greptim
### Before PR
- To ensure that community is free and confident in its ability to use your contributions, please sign the Contributor License Agreement (CLA) which will be incorporated in the pull request process.
- Make sure all files have proper license header (running `docker run --rm -v $(pwd):/github/workspace ghcr.io/korandoru/hawkeye-native:v3 format` from the project root).
- Make sure all your codes are formatted and follow the [coding style](https://pingcap.github.io/style-guide/rust/).
- Make sure all unit tests are passed.
- Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings -D clippy::print_stdout -D clippy::print_stderr`).
- Make sure all unit tests are passed (using `cargo test --workspace` or [nextest](https://nexte.st/index.html) `cargo nextest run`).
- Make sure all clippy warnings are fixed (you can check it locally by running `cargo clippy --workspace --all-targets -- -D warnings`).
#### `pre-commit` Hooks
You could setup the [`pre-commit`](https://pre-commit.com/#plugins) hooks to run these checks on every commit automatically.
1. Install `pre-commit`
```
$ pip install pre-commit
```
or
```
$ brew install pre-commit
```
pip install pre-commit
or
brew install pre-commit
2. Install the `pre-commit` hooks
```
$ pre-commit install
pre-commit installed at .git/hooks/pre-commit
$ pre-commit install --hook-type commit-msg
pre-commit installed at .git/hooks/commit-msg
$ pre-commit install
pre-commit installed at .git/hooks/pre-commit
$ pre-commit install --hook-type pre-push
pre-commit installed at .git/hooks/pre-pus
```
$ pre-commit install --hook-type commit-msg
pre-commit installed at .git/hooks/commit-msg
now `pre-commit` will run automatically on `git commit`.
$ pre-commit install --hook-type pre-push
pre-commit installed at .git/hooks/pre-push
Now, `pre-commit` will run automatically on `git commit`.
### Title
The titles of pull requests should be prefixed with category names listed in [Conventional Commits specification](https://www.conventionalcommits.org/en/v1.0.0)
like `feat`/`fix`/`docs`, with a concise summary of code change following. DO NOT use last commit message as pull request title.
like `feat`/`fix`/`docs`, with a concise summary of code change following. AVOID using the last commit message as pull request title.
### Description
@@ -101,11 +101,13 @@ of what you were trying to do and what went wrong. You can also reach for help i
## Community
The core team will be thrilled if you participate in any way you like. When you are stuck, try ask for help by filing an issue, with a detailed description of what you were trying to do and what went wrong. If you have any questions or if you would like to get involved in our community, please check out:
The core team will be thrilled if you would like to participate in any way you like. When you are stuck, try to ask for help by filing an issue, with a detailed description of what you were trying to do and what went wrong. If you have any questions or if you would like to get involved in our community, please check out:
- [GreptimeDB Community Slack](https://greptime.com/slack)
Try out the features of GreptimeDB right from your browser.
### Build
#### Build from Source
@@ -61,6 +67,12 @@ To compile GreptimeDB from source, you'll need:
find an installation instructions [here](https://grpc.io/docs/protoc-installation/).
**Note that `protoc` version needs to be >= 3.15** because we have used the `optional`
keyword. You can check it with `protoc --version`.
- python3-dev or python3-devel(Optional feature, only needed if you want to run scripts
in CPython, and also need to enable `pyo3_backend` feature when compiling(by `cargo run -F pyo3_backend` or add `pyo3_backend` to src/script/Cargo.toml 's `features.default` like `default = ["python", "pyo3_backend]`)): this install a Python shared library required for running Python
scripting engine(In CPython Mode). This is available as `python3-dev` on
ubuntu, you can install it with `sudo apt install python3-dev`, or
`python3-devel` on RPM based distributions (e.g. Fedora, Red Hat, SuSE). Mac's
`Python3` package should have this shared library by default. More detail for compiling with PyO3 can be found in [PyO3](https://pyo3.rs/v0.18.1/building_and_distribution#configuring-the-python-version)'s documentation.
#### Build with Docker
@@ -84,82 +96,46 @@ Or if you built from docker:
docker run -p 4002:4002 -v "$(pwd):/tmp/greptimedb" greptime/greptimedb standalone start
```
For more startup options, greptimedb's **distributed mode** and information
about Kubernetes deployment, check our [docs](https://docs.greptime.com/).
Please see the online document site for more installation options and [operations info](https://docs.greptime.com/user-guide/operations/overview).
### Connect
### Get started
1. Connect to GreptimeDB via standard [MySQL
client](https://dev.mysql.com/downloads/mysql/):
Read the [complete getting started guide](https://docs.greptime.com/getting-started/overview) on our [official document site](https://docs.greptime.com/).
```
# The standalone instance listen on port 4002 by default.
For Linux and macOS, you can easily download pre-built binaries including official releases and nightly builds that are ready to use.
In most cases, downloading the version without PyO3 is sufficient. However, if you plan to run scripts in CPython (and use Python packages like NumPy and Pandas), you will need to download the version with PyO3 and install a Python with the same version as the Python in the PyO3 version.
We recommend using virtualenv for the installation process to manage multiple Python versions.
@@ -201,6 +177,6 @@ Please refer to [contribution guidelines](CONTRIBUTING.md) for more information.
## Acknowledgement
- GreptimeDB uses [Apache Arrow](https://arrow.apache.org/) as the memory model and [Apache Parquet](https://parquet.apache.org/) as the persistent file format.
- GreptimeDB's query engine is powered by [Apache Arrow DataFusion](https://github.com/apache/arrow-datafusion).
- [OpenDAL](https://github.com/datafuselabs/opendal) from [Datafuse Labs](https://github.com/datafuselabs) gives GreptimeDB a very general and elegant data access abstraction layer.
- GreptimeDB’s meta service is based on [etcd](https://etcd.io/).
- [Apache OpenDAL (incubating)](https://opendal.apache.org) gives GreptimeDB a very general and elegant data access abstraction layer.
- GreptimeDB's meta service is based on [etcd](https://etcd.io/).
- GreptimeDB uses [RustPython](https://github.com/RustPython/RustPython) for experimental embedded python scripting.
# tracing exporter endpoint with format `ip:port`, we use grpc oltp as exporter, default endpoint is `localhost:4317`
# otlp_endpoint = "localhost:4317"
# The percentage of tracing will be sampled and exported. Valid range `[0, 1]`, 1 means all traces are sampled, 0 means all traces are not sampled, the default value is 1. ratio > 1 are treated as 1. Fractions < 0 are treated as 0
FROM --platform=linux/amd64 saschpe/android-ndk:34-jdk17.0.8_7-ndk25.2.9519653-cmake3.22.1
ENV LANG en_US.utf8
WORKDIR /greptimedb
# Rename libunwind to libgcc
RUN cp ${NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.7/lib/linux/aarch64/libunwind.a ${NDK_ROOT}/toolchains/llvm/prebuilt/linux-x86_64/lib64/clang/14.0.7/lib/linux/aarch64/libgcc.a
# Install dependencies.
RUN apt-get update && apt-get install -y \
libssl-dev \
protobuf-compiler \
curl \
git \
build-essential \
pkg-config \
python3 \
python3-dev \
python3-pip \
&& pip3 install --upgrade pip \
&& pip3 install pyarrow
# Trust workdir
RUN git config --global --add safe.directory /greptimedb
# Install Rust.
SHELL["/bin/bash","-c"]
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- --no-modify-path --default-toolchain none -y
@@ -140,8 +140,6 @@ Rollback is complicated to implement so some procedures might not support rollba
## Locking
The `ProcedureManager` can provide a locking mechanism that gives a procedure read/write access to a database object such as a table so other procedures are unable to modify the same table while the current one is executing.
Sub-procedures always inherit their parents' locks. The `ProcedureManager` only acquires locks for a procedure if its parent doesn't hold the lock.
# Drawbacks
The `Procedure` framework introduces additional complexity and overhead to our database.
- To execute a `Procedure`, we need to write to the `ProcedureStore` multiple times, which may slow down the server
This RFC proposes a method to achieve fault tolerance for regions in GreptimeDB's distributed mode. Or, put it in another way, achieving region high availability("HA") for GreptimeDB cluster.
In this RFC, we mainly describe two aspects of region HA: how region availability is detected, and what recovery process is need to be taken. We also discuss some alternatives and future work.
When this feature is done, our users could expect a GreptimeDB cluster that can always handle their requests to regions, despite some requests may failed during the region failover. The optimization to reduce the MTTR(Mean Time To Recovery) is not a concern of this RPC, and is left for future work.
# Motivation
Fault tolerance for regions is a critical feature for our clients to use the GreptimeDB cluster confidently. High availability for users to interact with their stored data is a "must have" for any TSDB products, that include our GreptimeDB cluster.
# Details
## Background
Some backgrounds about region in distributed mode:
- A table is logically split into multiple regions. Each region stores a part of non-overlapping table data.
- Regions are distributed in Datanodes, the mappings are not static, are assigned and governed by Metasrv.
- In distributed mode, client requests are scoped in regions. To be more specific, when a request that needs to scan multiple regions arrived in Frontend, Frontend splits the request into multiple sub-requests, each of which scans one region only, and submits them to Datanodes that hold corresponding regions.
In conclusion, as long as regions remain available, and regions could regain availability when failures do occur, the overall region HA could be achieved. With this in mind, let's see how region failures are detected first.
## Failure Detection
We detect region failures in Metasrv, and do it both passively and actively. Passively means that Metasrv do not fire some "are you healthy" requests to regions. Instead, we carry region healthy information in the heartbeat requests that are submit to Metasrv by Datanodes.
Datanode already carries its regions stats in the heartbeat request (the non-relevant fields are omitted):
```protobuf
messageHeartbeatRequest{
...
// Region stats on this node
repeatedRegionStatregion_stats=6;
...
}
messageRegionStat{
uint64region_id=1;
TableNametable_name=2;
...
}
```
For the sake of simplicity, we don't add another field `bool available = 3` to the `RegionStat` message; instead, if the region were unavailable in the view of the Datanode that contains it, the Datanode just not includes the `RegionStat` of it in the heartbeat request. Or, if the Datanode itself is not unavailable, the heartbeat request is not submitted, effectively the same with not carrying the `RegionStat`.
> The heartbeat interval is now hardcoded to five seconds.
Metasrv gathers the heartbeat requests, extracts the `RegionStat`s, and treat them as region heartbeat. In this way, Metasrv maintains all regions healthy information. If some region's heartbeats were not received in a period of time, Metasrv speculates the region might be unavailable. To make the decision whether a region is failed or not, Metasrv uses a failure detection algorithm called the "[Phi φ Accrual Failure Detection](https://medium.com/@arpitbhayani/phi-%CF%86-accrual-failure-detection-79c21ce53a7a)". Basically, the algorithm calculates a value called "phi" to represent the possibility of a region's unavailability, based on the historical heartbeats' arrived rate. Once the "phi" is above some pre-defined threshold, Metasrv knows the region is failed.
> This algorithm has been widely adopted in some well known products, like Akka and Cassandra.
When Metasrv decides some region is failed from heartbeats, it's not the final decision. Here comes the "actively" detection. Before Metasrv decides to do region failover, it actively invokes the healthy check interface of the Datanode that the failure region resides. Only this healthy check is failed does Metasrv actually start doing failover upon the region.
To conclude, the failure detection pseudo-codes are like this:
- Why active detecting while we have passively detection? Because it could be happened that the network is singly connectable sometimes (especially in the complex Cloud environment), then the Datanode's heartbeats cannot reach Metasrv, while Metasrv could request Datanode. Active detecting avoid this false positive situation.
- Why the detection works on region instead of Datanode? Because we might face the possibility that only part of the regions in the Datanode are not available, not ALL regions. Especially the situation that Datanodes are used by multiple tenants. If this is the case, it's better to do failover upon the designated regions instead of the whole regions that reside on the Datanode. All in all, we want a more subtle control over region failover.
So we detect some regions are not available. How to regain the availability back?
## Region Failover
Region Failover largely relies on remote WAL, aka "[Bunshin](https://github.com/GreptimeTeam/bunshin)". I'm not including any of the details of it in this RFC, let's just assume we already have it.
In general, region failover is fairly simple. Once Metasrv decides to do failover upon some regions, it first chooses one or more Datanodes to hold the failed region. This can be done easily, as the Metasrv already has the whole picture of Datanodes: it knows which Datanode has the minimum regions, what Datanode historically had the lowest CPU usage and IO rate, and how the Datanodes are assigned to tenants, among other information that can all help the Metasrv choose the most suitable Datanodes. Let's call these chosen Datanodes as "candidates".
> The strategy to choose the most suitable candidates required careful design, but it's another RFC.
Then, Metasrv sets the states of these failed regions as "passive". We should add a field to `Region`:
```protobuf
messageRegion{
uint64id=1;
stringname=2;
Partitionpartition=3;
messageState{
Active,
Passive,
}
Statestate=4;
map<string,string>attrs=100;
}
```
Here `Region` is used in message `RegionRoute`, which indicates how the write request is split among regions. When a region is set as "passive", Frontend knows the write to it should be rejected at the moment (the region read is not blocked, however).
> Making a region "passive" here is effectively blocking the write to it. It's ok in the failover situation, the region is failed anyway. However, when dealing with active maintenance operations, region state requires more refined design. But that's another story.
Third, Metasrv fires the "close region" requests to the failed Datanodes, and fires the "open region" requests to those candidates. "Close region" requests might be failed due to the unavailability of Datanodes, but that's fine, it's just a best-effort attempt to reduce the chance of any in-flight writes got handled unintentionally after the region is set as "passive". The "open region" requests must have succeeded though. Datanodes open regions from remote WAL.
> Currently the "close region" is undefined in Datanode. It could be a local cache clean up of region data or other resources tidy up.
Finally, when a candidate successfully opens its region, it calls back to Metasrv, indicating it is ready to handle region. "call back" here is backed by its heartbeat to Metasrv. Metasrv updates the region's state to "active", so as to let Frontend lifts the restrictions of region writes (again, the read part of region is untouched).
All the above steps should be managed by remote procedure framework. It's another implementation challenge in the region failover feature. (One is the remote WAL of course.)
Remote WAL raises a problem that could harm the write throughput of GreptimeDB cluster: each write request has to do at least two remote call, one is from Frontend to Datanode, and one is from Datanode to remote WAL. What if we do it the "[Neon](https://github.com/neondatabase/neon)" way, making remote WAL sits in between the Frontend and Datanode, couldn't that improve our write throughput? It could, though there're some consistency issues like "read-your-writes" to solve.
However, the main concerns we don't adopt this method are two-fold:
1. Remote WAL is planned to be quorum based, it can be efficiently written;
2. More importantly, we are planning to make the remote WAL an option that users could choose not to enable it (at the cost of some reliability reduction).
## No WAL, Replication instead
This method replicates region across Datanodes directly, like the common way in shared-nothing database. Were the main region failed, a standby region in the replicate group is elected as new "main" and take the read/write requests. The main concern to this method is the incompatibility to our current architecture and code structure. It requires a major redesign, but gains no significant advantage over the remote WAL method.
However, the replication does have its own advantage that we can learn from to optimize this failover procedure.
# Future Work
Some optimizations we could take:
- To reduce the MTTR, we could make Metasrv chooses the candidate to each region at normal time. The candidate does some preparation works to reduce the open region time, effectively accelerate the failover procedure.
- We can adopt the replication method, to the degree that region replicas are used as the fast catch-up candidates. The data difference among replicas is minor, region failover does not need to load or exchange too much data, greatly reduced the region failover time.
User data may already exist in other storages, i.g., file systems/s3, etc. in CSV, parquet, JSON format, etc. We can provide users the ability to perform SQL queries on these files.
# Details
## Overview
The file external table providers users ability to perform SQL queries on these files.
For example, a user has a CSV file on the local file system `/var/data/city.csv`:
```
Rank , Name , State , 2023 Population , 2020 Census , Annual Change , Density (mi²)
1 , New York City , New York , 8,992,908 , 8,804,190 , 0.7% , 29,938
2 , Los Angeles , California , 3,930,586 , 3,898,747 , 0.27% , 8,382
<col_name> <col_type> [NULL | NOT NULL] [COMMENT "<comment>"]
)
]
[ WITH
(
LOCATION = 'url'
[,FIELD_DELIMITER = 'delimiter' ]
[,RECORD_DELIMITER = 'delimiter' ]
[,SKIP_HEADER = '<number>' ]
[,FORMAT = { csv | json | parquet } ]
[,PATTERN = '<regex_pattern>' ]
[,ENDPOINT = '<uri>' ]
[,ACCESS_KEY_ID = '<key_id>' ]
[,SECRET_ACCESS_KEY = '<access_key>' ]
[,SESSION_TOKEN = '<token>' ]
[,REGION = '<region>' ]
[,ENABLE_VIRTUAL_HOST_STYLE = '<boolean>']
..
)
]
```
### Supported File Format
The external file table supports multiple formats; We divide formats into row format and columnar format.
Row formats:
- CSV, JSON
Columnar formats:
- Parquet
Some of these formats support filter pushdown, and others don't. If users use very large files, that format doesn't support pushdown, which might consume a lot of IO for scanning full files and cause a long running query.
### File Table Engine

We implement a file table engine that creates an external table by accepting user-specified file paths and treating all records as immutable.
1. File Format Decoder: decode files to the `RecordBatch` stream.
2. File Table Engine: implement the `TableProvider` trait, store necessary metadata in memory, and provide scan ability.
Our implementation is better for small files. For large files(i.g., a GB-level CSV file), suggests our users import data to the database.
## Drawbacks
- Some formats don't support filter pushdown
- Hard to support indexing
## Life cycle
### Register a table
1. Write metadata to manifest.
2. Create the table via file table engine.
3. Register table to `CatalogProvider` and register table to `SystemCatalog`(persist tables to disk).
### Deregister a table (Drop a table)
1. Fetch the target table info (figure out table engine type).
2. Deregister the target table in `CatalogProvider` and `SystemCatalog`.
3. Find the target table engine.
4. Drop the target table.
### Recover a table when restarting
1. Collect tables name and engine type info.
2. Find the target tables in different engines.
3. Open and register tables.
# Alternatives
## Using DataFusion API
We can use datafusion API to register a file table:
letdf=ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?;
```
### Drawbacks
The DataFusion implements its own `Object Store` abstraction and supports parsing the partitioned directories, which can push down the filter and skips some directories. However, this makes it impossible to use our's `LruCacheLayer`(The parsing of the partitioned directories required paths as input). If we want to manage memory entirely, we should implement our own `TableProvider` or `Table`.
- Impossible to use `CacheLayer`
## Introduce an intermediate representation layer

We convert all files into `parquet` as an intermediate representation. Then we only need to implement a `parquet` file table engine, and we already have a similar one. Also, it supports limited filter pushdown via the `parquet` row group stats.
Enhance the logical planner with aware of distributed, multi-region table topology. To achieve "push computation down" execution rather than the current "pull data up" manner.
# Motivation
Query distributively can leverage the advantage of GreptimeDB's architecture to process large dataset that exceeds the capacity of a single node, or accelerate the query execution by executing it in parallel. This task includes two sub-tasks
- Be able to transform the plan that can push as much as possible computation down to data source.
- Be able to handle pipeline breaker (like `Join` or `Sort`) on multiple computation nodes.
This is a relatively complex topic. To keep this RFC concentrated I'll focus on the first one.
# Details
## Background: Partition and Region
GreptimeDB supports table partitioning, where the partition rule is set during table creation. Each partition can be further divided into one or more physical storage units known as "regions". Both partitions and regions are divided based on rows:
``` text
┌────────────────────────────────────┐
│ │
│ Table │
│ │
└─────┬────────────┬────────────┬────┘
│ │ │
│ │ │
┌─────▼────┐ ┌─────▼────┐ ┌─────▼────┐
│ Region 1 │ │ Region 2 │ │ Region 3 │
└──────────┘ └──────────┘ └──────────┘
Row 1~10 Row 11~20 Row 21~30
```
General speaking, region is the minimum element of data distribution, and we can also use it as the unit to distribute computation. This can greatly simplify the routing logic of this distributed planner, by always schedule the computation to the node that currently opening the corresponding region. And is also easy to scale more node for computing since GreptimeDB's data is persisted on shared storage backend like S3. But this is a bit beyond the scope of this specific topic.
## Background: Commutativity
Commutativity is an attribute that describes whether two operation can exchange their apply order: $P1(P2(R)) \Leftrightarrow P2(P1(R))$. If the equation keeps, we can transform one expression into another form without changing its result. This is useful on rewriting SQL expression, and is the theoretical basis of this RFC.
Take this SQL as an example
``` sql
SELECT a FROM t WHERE a > 10;
```
As we know projection and filter are commutative (todo: latex), it can be translated to the following two identical plan trees:
```text
┌─────────────┐ ┌─────────────┐
│Projection(a)│ │Filter(a>10) │
└──────▲──────┘ └──────▲──────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│Filter(a>10) │ │Projection(a)│
└──────▲──────┘ └──────▲──────┘
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ TableScan │ │ TableScan │
└─────────────┘ └─────────────┘
```
## Merge Operation
This RFC proposes to add a new expression node `MergeScan` to merge result from several regions in the frontend. It wrap the abstraction of remote data and execution, and expose a `TableScan` interface to upper level.
``` text
▲
│
┌───────┼───────┐
│ │ │
│ ┌──┴──┐ │
│ └──▲──┘ │
│ │ │
│ ┌──┴──┐ │
│ └──▲──┘ │ ┌─────────────────────────────┐
│ │ │ │ │
│ ┌────┴────┐ │ │ ┌──────────┐ ┌───┐ ┌───┐ │
│ │MergeScan◄──┼────┤ │ Region 1 │ │ │ .. │ │ │
│ └─────────┘ │ │ └──────────┘ └───┘ └───┘ │
│ │ │ │
└─Frontend──────┘ └─Remote-Sources──────────────┘
```
This merge operation simply chains all the the underlying remote data sources and return `RecordBatch`, just like a coalesce op. And each remote sources is a gRPC query to datanode via the substrait logical plan interface. The plan is transformed and divided from the original query that comes to frontend.
## Commutativity of MergeScan
Obviously, The position of `MergeScan` is the key of the distributed plan. The more closer to the underlying `TableScan`, the less computation is taken by datanodes. Thus the goal is to pull the `MergeScan` up as more as possible. The word "pull up" means exchange `MergeScan` with its parent node in the plan tree, which means we should check the commutativity between the existing expression nodes and the `MergeScan`. Here I classify all the possibility into five categories:
After establishing the set of commutative relations for all expressions, we can begin transforming the logical plan. There are four steps:
- Add a merge node before table scan
- Evaluate commutativity in a bottom-up way, stop at the first non-commutative node
- Divide the TableScan to scan over partitions
- Execute
First insert the `MergeScan` on top of the bottom `TableScan` node. Then examine the commutativity start from the `MergeScan` node transform the plan tree based on the result. Stop this process on the first non-commutative node.
``` text
┌─────────────┐ ┌─────────────┐
│ Sort │ │ Sort │
└──────▲──────┘ └──────▲──────┘
│ │
┌─────────────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Sort │ │Projection(a)│ │ MergeScan │
└──────▲──────┘ └──────▲──────┘ └──────▲──────┘
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│Projection(a)│ │ MergeScan │ │Projection(a)│
└──────▲──────┘ └──────▲──────┘ └──────▲──────┘
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ TableScan │ │ TableScan │ │ TableScan │
└─────────────┘ └─────────────┘ └─────────────┘
(a) (b) (c)
```
Then in the physical planning phase, convert the sub-tree below `MergeScan` into a remote query request and dispatch to all the regions. And let the `MergeScan` to receive the results and feed to it parent node.
To simplify the overall complexity, any error in the procedure will lead to a failure to the entire query, and cancel all other parts.
# Alternatives
## Spill
If only consider the ability of processing large dataset, we can enable DataFusion's spill ability to temporary persist intermediate data into disk, like the "swap" memory. But this will lead to a super slow performance and very large write amplification.
# Future Work
As described in the `Motivation` section we can further explore the distributed planner on the physical execution level, by introducing mechanism like Spark's shuffle to improve parallelism and reduce intermediate pipeline breaker's stage.
Refactor table engines to address several historical tech debts.
# Motivation
Both `Frontend` and `Datanode` have to deal with multiple regions in a table. This results in code duplication and additional burden to the `Datanode`.
Before:
```mermaid
graph TB
subgraph Frontend["Frontend"]
subgraph MyTable
A("region 0, 2 -> Datanode0")
B("region 1, 3 -> Datanode1")
end
end
MyTable --> MetaSrv
MetaSrv --> ETCD
MyTable-->TableEngine0
MyTable-->TableEngine1
subgraph Datanode0
Procedure0("procedure")
TableEngine0("table engine")
region0
region2
mytable0("my_table")
Procedure0-->mytable0
TableEngine0-->mytable0
mytable0-->region0
mytable0-->region2
end
subgraph Datanode1
Procedure1("procedure")
TableEngine1("table engine")
region1
region3
mytable1("my_table")
Procedure1-->mytable1
TableEngine1-->mytable1
mytable1-->region1
mytable1-->region3
end
subgraph manifest["table manifest"]
M0("my_table")
M1("regions: [0, 1, 2, 3]")
end
mytable1-->manifest
mytable0-->manifest
RegionManifest0("region manifest 0")
RegionManifest1("region manifest 1")
RegionManifest2("region manifest 2")
RegionManifest3("region manifest 3")
region0-->RegionManifest0
region1-->RegionManifest1
region2-->RegionManifest2
region3-->RegionManifest3
```
`Datanodes` can update the same manifest file for a table as regions are assigned to different nodes in the cluster. We also have to run procedures on `Datanode` to ensure the table manifest is consistent with region manifests. "Table" in a `Datanode` is a subset of the table's regions. The `Datanode` is much closer to `RegionServer` in `HBase` which only deals with regions.
In cluster mode, we store table metadata in etcd and table manifest. The table manifest becomes redundant. We can remove the table manifest if we refactor the table engines to region engines that only care about regions. What's more, we don't need to run those procedures on `Datanode`.
After:
```mermaid
graph TB
subgraph Frontend["Frontend"]
direction LR
subgraph MyTable
A("region 0, 2 -> Datanode0")
B("region 1, 3 -> Datanode1")
end
end
MyTable --> MetaSrv
MetaSrv --> ETCD
MyTable-->RegionEngine
MyTable-->RegionEngine1
subgraph Datanode0
RegionEngine("region engine")
region0
region2
RegionEngine-->region0
RegionEngine-->region2
end
subgraph Datanode1
RegionEngine1("region engine")
region1
region3
RegionEngine1-->region1
RegionEngine1-->region3
end
RegionManifest0("region manifest 0")
RegionManifest1("region manifest 1")
RegionManifest2("region manifest 2")
RegionManifest3("region manifest 3")
region0-->RegionManifest0
region1-->RegionManifest1
region2-->RegionManifest2
region3-->RegionManifest3
```
This RFC proposes to refactor table engines into region engines as a first step to make the `Datanode` acts like a `RegionServer`.
# Details
## Overview
We plan to refactor the `TableEngine` trait into `RegionEngine` gradually. This RFC focuses on the `mito` engine as it is the default table engine and the most complicated engine.
Currently, we built `MitoEngine` upon `StorageEngine` that manages regions of the `mito` engine. Since `MitoEngine` becomes a region engine, we could combine `StorageEngine` with `MitoEngine` to simplify our code structure.
The chart below shows the overall architecture of the `MitoEngine`.
A new metric engine that can significantly enhance our ability to handle the tremendous number of small tables in scenarios like Prometheus metrics, by leveraging a synthetic wide table that offers storage and metadata multiplexing capabilities over the existing engine.
# Motivation
The concept "Table" in GreptimeDB is a bit "heavy" compared to other time-series storage like Prometheus or VictoriaMetrics. This has lots of disadvantages in aspects from performance, footprint, and storage to cost.
# Details
## Top level description
- User Interface
This feature will add a new type of storage engine. It might be available to be an option like `with ENGINE=mito` or an internal interface like auto create table on Prometheus remote write. From the user side, there is no difference from tables in mito engine. All the DDL like `CREATE`, `ALTER` and DML like `SELECT` should be supported.
- Implementation Overlook
This new engine doesn't re-implement low level components like file R/W etc. It's a wrapper layer over the existing mito engine, with extra storage and metadata multiplexing capabilities. I.e., it expose multiple table based on one mito engine table like this:
The following parts will describe these implementation details:
- How to route these metric region tables and how those table are distributed
- How to maintain the schema and other metadata of the underlying mito engine table
- How to maintain the schema of metric engine table
- How the query goes
## Routing
Before this change, the region route rule was based on a group of partition keys. Relation of physical table to region is one-to-many.
``` rust
pub struct PartitionDef {
partition_columns: Vec<String>,
partition_bounds: Vec<PartitionBound>,
}
```
And for metric engine tables, the key difference is we split the concept of "physical table" and "logical table". Like the previous ASCII chart, multiple logical tables are based on one physical table. The relationship of logical table to region becomes many-to-many. Thus, we must include the table name (of logical table) into partition rules.
Consider the partition/route interface is a generic map of string array to region id, all we need to do is to insert logical table name into the request:
``` rust
fn route(request: Vec<String>) -> RegionId;
```
The next question is, where to do this conversion? The basic idea is to dispatch different routing behavior based on the engine type. Since we have all the necessary information in frontend, it's a good place to do that. And can leave meta server untouched. The essential change is to associate engine type with route rule.
## Physical Region Schema
The idea "physical wide table" is to perform column-level multiplexing. I.e., map all logical columns to physical columns by their names.
This approach is very straightforward but has one problem. It only works when two columns have different semantic type (time index, tag or field) or data types but with the same name. E.g., `CREATE TABLE t1 (c1 timestamp(3) TIME INDEX)` and `CREATE TABLE t2 (c1 STRING PRIMARY KEY)`.
One possible workaround is to prefix each column with its data type and semantic type, like `_STRING_PK_c1`. However, considering the primary goal at present is to support data from monitoring metrics like Prometheus remote write, it's acceptable not to support this at first because data types are often simple and limited here.
The next point is changing the physical table's schema. This is only needed when creating a new logical table or altering the existing table. Typically speaking, table creating and altering are explicit. We only need to emit an add column request to underlying physical table on processing logical table's DDL. GreptimeDB can create or alter table automatically on some protocols, but the internal logic is the same.
Also for simplicity, we don't support shrinking the underlying table at first. This can be achieved by introducing mechanism on the physical column.
Frontend needs not to keep physical table's schema.
## Metadata of physical regions
Those metric engine regions need to store extra metadata like the schema of logical table or all logical table's name. That information is relatively simple and can be stored in a format like key-value pair. For now, we have to use another physical mito region for metadata. This involves an issue with region scheduling. Since we don't have the ability to perform affinity scheduling, the initial version will just assume the data region and metadata region are in the same instance. See alternatives - other storage for physical region's metadata for possible future improvement.
Here is the schema of metadata region and how we would use it. The `CREATE TABLE` clause of metadata region looks like the following. Notice that it wouldn't be actually created by SQL.
``` sql
CREATE TABLE metadata(
ts timestamp time index,
key string primary key,
value string
);
```
The `ts` field is just a placeholder -- for the constraints that a mito region must contain a time index field. It will always be `0`. The other two fields `key` and `value` will be used as a k-v storage. It contains two group of key
- `__table_<TABLE_NAME>` is used for marking table existence. It doesn't have value.
- `__column_<TABLE_NAME>_<COLUMN_NAME>` is used for marking table existence, the value is column's semantic type.
## Physical region implementation
This RFC proposes to add a new region implementation named "MetricRegion". As showed in the first chart, it's wrapped over the existing mito region. This section will describe the implementation details. Firstly, here is a chart shows how the region hierarchy looks like:
```plaintext
┌───────────────────────┐
│ Metric Region │
│ │
│ ┌────────┬──────────┤
│ │ Mito │ Mito │
│ │ Region │ Region │
│ │ for │ for │
│ │ Data │ Metadata │
└───┴────────┴──────────┘
```
All upper levels only see the Metric Region. E.g., Meta Server schedules on this region, or Frontend routes requests to this Metrics Region's id. To be scheduled (open or close etc.), Metric Region needs to implement its own procedures. Most of those procedures can be simply assembled from underlying Mito Regions', but those related to data like alter or drop will have its own new logic.
Another point is region id. Since the region id is used widely from meta server to persisted state, it's better to keep it unchanged. This means we can't use the same id for two regions, but one for each. To achieve this, this RFC proposes a concept named "region id group". A region id group is a group of region ids that are bound for different purposes. Like the two underlying regions here.
This preserves the first 8 bits of the `u32` region number for grouping. Each group has one main id (the first one) and other sub ids (the rest non-zero ids). All components other than the region implementation itself doesn't aware of the existence of region id group. They only see the main id. The region implementation is in response of managing and using the region id group.
From previous sections, we can conclude the following points about routing:
- Each "logical table" has its own, universe unique table id.
- Logical table doesn't have physical region, they share the same physical region with other logical tables.
- Route rule of logical table's is a strict subset of physical table's.
To associate the logical table with physical region, we need to specify necessary information in the create table request. Specifically, the table type and its parent table. This require to change our gRPC proto's definition. And once meta recognize the table to create is a logical table, it will use the parent table's region to create route entry.
And to reduce the consumption of region failover (which need to update the physical table route info), we'd better to split the current route table structure into two parts:
```rust
region_route: Map<TableName, [RegionId]>,
node_route: Map<RegionId, NodeId>,
```
By doing this on each failover the meta server only needs to update the second `node_route` map and leave the first one untouched.
## Query
Like other existing components, a user query always starts in the frontend. In the planning phase, frontend needs to fetch related schemas of the queried table. This part is the same. I.e., changes in this RFC don't affect components above the `Table` abstraction.
# Alternatives
## Other routing method
We can also do this "special" route rule in the meta server. But there is no difference with the proposed method.
## Other storage for physical region's metadata
Once we have implemented the "region family" that allows multiple physical schemas exist in one region, we can store the metadata and table data into one region.
Before that, we can also let the `MetricRegion` holds a `KvBackend` to access the storage layer directly. But this breaks the abstraction in some way.
# Drawbacks
Since the physical storage is mixed together. It's hard to do fine-grained operations in table level. Like configuring TTL, memtable size or compaction strategy in table level. Or define different partition rules for different tables. For scenarios like this, it's better to move the table out of metrics engine and "upgrade" it to a normal mito engine table. This requires a migration process in a low cost. And we have to ensure data consistency during the migration, which may require a out-of-service period.
Refactor `Table` trait to adapt the new region server architecture and make code more straightforward.
# Motivation
The `Table` is designed in the background of both frontend and datanode keeping the same concepts. And all the operations are served by a `Table`. However, in our practice, we found that not all the operations are suitable to be served by a `Table`. For example, the `Table` doesn't hold actual physical data itself, thus operations like write or alter are simply a proxy over underlying regions. And in the recent refactor to datanode ([rfc table-engine-refactor](./2023-07-06-table-engine-refactor.md)), we are changing datanode to region server that is only aware of `Region` things. This also calls for a refactor to the `Table` trait.
# Details
## Definitions
The current `Table` trait contains the following methods:
```rust
pubtraitTable{
/// Get a reference to the schema for this table
fnschema(&self)-> SchemaRef;
/// Get a reference to the table info.
fntable_info(&self)-> TableInfoRef;
/// Get the type of this table for metadata/catalog purposes.
And considering most of the access to metadata happens in frontend, like route or query; and all the persisted data are stored in regions; while only the query engine needs to read data. We can divide the `Table` trait into three concepts:
- struct `Table` provides metadata:
```rust
impl Table {
/// Get a reference to the schema for this table
fn schema(&self) -> SchemaRef;
/// Get a reference to the table info.
fn table_info(&self) -> TableInfoRef;
/// Get the type of this table for metadata/catalog purposes.
fn table_type(&self) -> TableType;
/// Get statistics for this table, if available
fn statistics(&self) -> Option<TableStatistics>;
fn to_data_source(&self) -> DataSourceRef;
}
```
- Requests to region server
- `InsertRequest`
- `AlterRequest`
- `DeleteRequest`
- `FlushRequest`
- `CompactRequest`
- `CloseRequest`
- trait `DataSource` provides data (`RecordBatch`)
`Table` will only be used in frontend. It's constructed from the `OpenTableRequest` or `CreateTableRequest`.
`Table` also provides a method `to_data_source` to generate a `DataSource` from itself. But this method is only for non-`TableType::Base` tables (i.e., `TableType::View` and `TableType::Temporary`) because `TableType::Base` table doesn't hold actual data itself. Its `DataSource` should be constructed from the `Region` directly (in other words, it's a remote query).
And it requires some extra information to construct a `DataSource`, named `TableSourceProvider`:
```rust
type TableFactory = Arc<dyn Fn() -> DataSourceRef>;
pub enum TableSourceProvider {
Base,
View(LogicalPlan),
Temporary(TableFactory),
}
```
## Use `DataSource`
`DataSource` will be adapted to the `TableProvider` from DataFusion that can be `scan()`ed in a `TableScan` plan.
In frontend this is done in the planning phase. And datanode will have one implementation for `Region` to generate record batch stream.
## Interact with RegionServer
Previously, persisted state change operations were through the old `Table` trait, like said before. Now they will come from the action source, like the procedure or protocol handler directly to the region server. E.g., on alter table, the corresponding procedure will generate its `AlterRequest` and send it to regions. Or write request will be split in frontend handler, and sent to regions. `Table` only provides necessary metadata like route information if needed, but not the necessary part anymore.
## Implement temporary table
Temporary table is a special table that doesn't revolves to any persistent physical region. Examples are:
- the `Numbers` table for testing, which produces a record batch that contains 0-100 integers.
- tables in information schema. It is an interface for querying catalog's metadata. The contents are generated on the fly with information from `CatalogManager`. The `CatalogManager` can be held in `TableFactory`.
- Function table that produces data generated by a formula or a function. Like something that always `sin(current_timestamp())`.
## Relationship among those components
Here is a diagram to show the relationship among those components, and how they interact with each other.
Currently, multiple transactions are involved during the procedure. This implementation is inefficient, and it's hard to make data consistent. Therefore, We can update multiple metadata in a single transaction.
These table metadata only updates in the following operations.
## Region Failover
It needs to update `TableRoute` key and `DatanodeTable` keys. If the `TableRoute` equals the Snapshot of `TableRoute` submitting the Failover task, then we can safely update these keys.
After submitting Failover tasks to acquire locks for execution, the `TableRoute` may be updated by another task. After acquiring the lock, we can get the latest `TableRoute` again and then execute it if needed.
## Create Table DDL
Creates all of the above keys. `TableRoute`, `TableInfo`, should be empty.
The **TableNameKey**'s lock will be held by the procedure framework.
## Drop Table DDL
`TableInfoKey` and `NextTableRouteKey` will be added with `__removed-` prefix, and the other above keys will be deleted. The transaction will not compare any keys.
## Alter Table DDL
1. Rename table, updates `TableInfo` and `TableName`. Compares `TableInfo`, and the new `TableNameKey` should be empty, and TableInfo should equal the Snapshot when submitting DDL.
The old and new **TableNameKey**'s lock will be held by the procedure framework.
2. Alter table, updates `TableInfo`. `TableInfo` should equal the Snapshot when submitting DDL.
This RFC proposes an optimization towards the storage engine by introducing an inverted indexing methodology aimed at optimizing label selection queries specifically pertaining to Metrics with tag columns as the target for optimization.
# Introduction
In the current system setup, in the Mito Engine, the first column of Primary Keys has a Min-Max index, which significantly optimizes the outcome. However, there are limitations when it comes to other columns, primarily tags. This RFC suggests the implementation of an inverted index to provide enhanced filtering benefits to bridge these limitations and improve overall system performance.
# Design Detail
## Inverted Index
The primary aim of the proposed inverted index is to optimize tag columns in the SST Parquet Files within the Mito Engine. The mapping and construction of an inverted index, from Tag Values to Row Groups, enables efficient logical structures that provide faster and more flexible queries.
When scanning SST Files, pushed-down filters applied to a respective Tag's inverted index, determine the final Row Groups to be indexed and scanned, further bolstering the speed and efficiency of data retrieval processes.
## Index Format
The Inverted Index for each SST file follows the format shown below:
The `footer_payload` is presented in protobuf encoding of `InvertedIndexFooter`.
The complete format is containerized in [Puffin](https://iceberg.apache.org/puffin-spec/) with the type defined as `greptime-inverted-index-v1`.
## Protobuf Details
The `InvertedIndexFooter` is defined in the following protobuf structure:
```protobuf
messageInvertedIndexFooter{
repeatedInvertedIndexMetametas;
}
messageInvertedIndexMeta{
stringname;
uint64row_count_in_group;
uint64fst_offset;
uint64fst_size;
uint64null_bitmap_offset;
uint64null_bitmap_size;
InvertedIndexStatsstats;
}
messageInvertedIndexStats{
uint64null_count;
uint64distinct_count;
bytesmin_value;
bytesmax_value;
}
```
## Bitmap
Bitmaps are used to represent indices of fixed-size groups. Rows are divided into groups of a fixed size, defined in the `InvertedIndexMeta` as `row_count_in_group`.
For example, when `row_count_in_group` is `4096`, it means each group has `4096` rows. If there are a total of `10000` rows, there will be `3` groups in total. The first two groups will have `4096` rows each, and the last group will have `1808` rows. If the indexed values are found in row `200` and `9000`, they will correspond to groups `0` and `2`, respectively. Therefore, the bitmap should show `0` and `2`.
Bitmap is implemented using [BitVec](https://docs.rs/bitvec/latest/bitvec/), selected due to its efficient representation of dense data arrays typical of indices of groups.
## Finite State Transducer (FST)
[FST](https://docs.rs/fst/latest/fst/) is a highly efficient data structure ideal for in-memory indexing. It represents ordered sets or maps where the keys are bytes. The choice of the FST effectively balances the need for performance, space efficiency, and the ability to perform complex analyses such as regular expression matching.
The conventional usage of FST and `u64` values has been adapted to facilitate indirect indexing to row groups. As the row groups are represented as Bitmaps, we utilize the `u64` values split into bitmap's offset (higher 32 bits) and size (lower 32 bits) to represent the location of these Bitmaps.
## API Design
Two APIs `InvertedIndexBuilder` for building indexes and `InvertedIndexSearcher` for querying indexes are designed:
**Only the red nodes will persist state after it has succeeded**, and other nodes won't persist state. (excluding the Start and End nodes).
## Steps
**The persistent context:** It's shared in each step and available after recovering. It will only be updated/stored after the Red node has succeeded.
Values:
-`region_id`: The target leader region.
-`peer`: The target datanode.
-`close_old_leader`: Indicates whether close the region.
-`leader_may_unreachable`: It's used to support the failover procedure.
**The Volatile context:** It's shared in each step and available in executing (including retrying). It will be dropped if the procedure runner crashes.
### Select Candidate
The Persistent state: Selected Candidate Region.
### Update Metadata(Down)
**The Persistent context:**
- The (latest/updated) `version` of `TableRouteValue`, It will be used in the step of `Update Metadata(Up)`.
### Downgrade Leader
This step sends an instruction via heartbeat and performs:
1. Downgrades leader region.
2. Retrieves the `last_entry_id` (if available).
If the target leader region is not found:
- Sets `close_old_leader` to true.
- Sets `leader_may_unreachable` to true.
If the target Datanode is unreachable:
- Waits for region lease expired.
- Sets `close_old_leader` to true.
- Sets `leader_may_unreachable` to true.
**The Persistent context:**
None
**The Persistent state:**
-`last_entry_id`
*Passes to next step.
### Upgrade Candidate
This step sends an instruction via heartbeat and performs:
1. Replays the WAL to latest(`last_entry_id`).
2. Upgrades the candidate region.
If the target region is not found:
- Rollbacks.
- Notifies the failover detector if `leader_may_unreachable` == true.
- Exits procedure.
If the target Datanode is unreachable:
- Rollbacks.
- Notifies the failover detector if `leader_may_unreachable` == true.
- Exits procedure.
**The Persistent context:**
None
### Update Metadata(Up)
This step performs
1. Switches Leader.
2. Removes Old Leader(Opt.).
3. Moves Old Leader to follower(Opt.).
The `TableRouteValue` version should equal the `TableRouteValue`'s `version` in Persistent context. Otherwise, verifies whether `TableRouteValue` already updated.
**The Persistent context:**
None
### Close Old Leader(Opt.)
This step sends a close region instruction via heartbeat.
If the target leader region is not found:
- Ignore.
If the target Datanode is unreachable:
- Ignore.
### Open Candidate(Opt.)
This step sends an open region instruction via heartbeat and waits for conditions to be met (typically, the condition is that the `last_entry_id` of the Candidate Region is very close to that of the Leader Region or the latest).
The `datatypes` crate defines the elementary schema struct to describe the metadata.
## ColumnSchema
[ColumnSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/datatypes/src/schema/column_schema.rs#L36) represents the metadata of a column. It is equivalent to arrow's [Field](https://docs.rs/arrow/latest/arrow/datatypes/struct.Field.html) with additional metadata such as default constraint and whether the column is a time index. The time index is the column with a `TIME INDEX` constraint of a table. We can convert the `ColumnSchema` into an arrow `Field` and convert the `Field` back to the `ColumnSchema` without losing metadata.
[Schema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/datatypes/src/schema.rs#L38) is an ordered sequence of `ColumnSchema`. It is equivalent to arrow's [Schema](https://docs.rs/arrow/latest/arrow/datatypes/struct.Schema.html) with additional metadata including the index of the time index column and the version of this schema. Same as `ColumnSchema`, we can convert our `Schema` from/to arrow's `Schema`.
```rust
usearrow::datatypes::SchemaasArrowSchema;
pubstructSchema{
column_schemas: Vec<ColumnSchema>,
name_to_index: HashMap<String,usize>,
arrow_schema: Arc<ArrowSchema>,
timestamp_index: Option<usize>,
version: u32,
}
pubtypeSchemaRef=Arc<Schema>;
```
We alias `Arc<Schema>` as `SchemaRef` since it is used frequently. Mostly, we use our `ColumnSchema` and `Schema` structs instead of Arrow's `Field` and `Schema` unless we need to invoke third-party libraries (like DataFusion or ArrowFlight) that rely on Arrow.
## RawSchema
`Schema` contains fields like a map from column names to their indices in the `ColumnSchema` sequences and a cached arrow `Schema`. We can construct these fields from the `ColumnSchema` sequences thus we don't want to serialize them. This is why we don't derive `Serialize` and `Deserialize` for `Schema`. We introduce a new struct [RawSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/datatypes/src/schema/raw.rs#L24) which keeps all required fields of a `Schema` and derives the serialization traits. To serialize a `Schema`, we need to convert it into a `RawSchema` first and serialize the `RawSchema`.
```rust
pubstructRawSchema{
pubcolumn_schemas: Vec<ColumnSchema>,
pubtimestamp_index: Option<usize>,
pubversion: u32,
}
```
We want to keep the `Schema` simple and avoid putting too much business-related metadata in it as many different structs or traits rely on it.
# Schema of the Table
A table maintains its schema in [TableMeta](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/table/src/metadata.rs#L97).
```rust
pubstructTableMeta{
pubschema: SchemaRef,
pubprimary_key_indices: Vec<usize>,
pubvalue_indices: Vec<usize>,
// ...
}
```
The order of columns in `TableMeta::schema` is the same as the order specified in the `CREATE TABLE` statement which users use to create this table.
The field `primary_key_indices` stores indices of primary key columns. The field `value_indices` records the indices of value columns (non-primary key and time index, we sometimes call them field columns).
We split a table into one or more units with the same schema and then store these units in the storage engine. Each unit is a region in the storage engine.
The storage engine maintains schemas of regions in more complicated ways because it
- adds internal columns that are invisible to users to store additional metadata for each row
- provides a data model similar to the key-value model so it organizes columns in a different order
- maintains additional metadata like column id or column family
So the storage engine defines several schema structs:
- RegionSchema
- StoreSchema
- ProjectedSchema
## RegionSchema
A [RegionSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/region.rs#L37) describes the schema of a region.
```rust
pubstructRegionSchema{
user_schema: SchemaRef,
store_schema: StoreSchemaRef,
columns: ColumnsMetadataRef,
}
```
Each region reserves some columns called `internal columns` for internal usage:
-`__sequence`, sequence number of a row
-`__op_type`, operation type of a row, such as `PUT` or `DELETE`
-`__version`, user-specified version of a row, reserved but not used. We might remove this in the future
The table engine can't see the `__sequence` and `__op_type` columns, so the `RegionSchema` itself maintains two internal schemas:
- User schema, a `Schema` struct that doesn't have internal columns
- Store schema, a `StoreSchema` struct that has internal columns
The `ColumnsMetadata` struct keeps metadata about all columns but most time we only need to use metadata in user schema and store schema, so we just ignore it. We may remove this struct in the future.
`RegionSchema` organizes columns in the following order:
```
key columns, timestamp, [__version,] value columns, __sequence, __op_type
```
We can ignore the `__version` column because it is disabled now:
```
key columns, timestamp, value columns, __sequence, __op_type
```
Key columns are columns of a table's primary key. Timestamp is the time index column. A region sorts all rows by key columns, timestamp, sequence, and op type.
So the `RegionSchema` of our `cpu` table above looks like this:
```json
{
"user_schema":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system"
],
"store_schema":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"__sequence",
"__op_type"
]
}
```
## StoreSchema
As described above, a [StoreSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/store.rs#L36) is a schema that knows all internal columns.
```rust
structStoreSchema{
columns: Vec<ColumnMetadata>,
schema: SchemaRef,
row_key_end: usize,
user_column_end: usize,
}
```
The columns in the `columns` and `schema` fields have the same order. The `ColumnMetadata` has metadata like column id, column family id, and comment. The `StoreSchema` also stores this metadata in `StoreSchema::schema`, so we can convert the `StoreSchema` between arrow's `Schema`. We use this feature to persist the `StoreSchema` in the SST since our SST format is `Parquet`, which can take arrow's `Schema` as its schema.
The `StoreSchema` of the region above is similar to this:
```json
{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"__sequence",
"__op_type"
],
"time_index":2,
"version":0
},
"row_key_end":3,
"user_column_end":5
}
```
The key and timestamp columns form row keys of rows. We put them together so we can use `row_key_end` to get indices of all row key columns. Similarly, we can use the `user_column_end` to get indices of all user columns (non-internal columns).
Another useful feature of `StoreSchema` is that we ensure it always contains key columns, a timestamp column, and internal columns because we need them to perform merge, deduplication, and delete. Projection on `StoreSchema` only projects value columns.
## ProjectedSchema
To support arbitrary projection, we introduce the [ProjectedSchema](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/projected.rs#L106).
```rust
pubstructProjectedSchema{
projection: Option<Projection>,
schema_to_read: StoreSchemaRef,
projected_user_schema: SchemaRef,
}
```
We need to handle many cases while doing projection:
- The columns' order of table and region is different
- The projection can be in arbitrary order, e.g. `select usage_user, host from cpu` and `select host, usage_user from cpu` have different projection order
- We support `ALTER TABLE` so data files may have different schemas.
### Projection
Let's take an example to see how projection works. Suppose we want to select `ts`, `usage_system` from the `cpu` table.
The query engine uses the projection `[0, 3]` to scan the table. However, columns in the region have a different order, so the table engine adjusts the projection to `2, 4`.
```json
{
"user_schema":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system"
],
}
```
As you can see, the output order is still `[ts, usage_system]`. This is the schema users can see after projection so we call it `projected user schema`.
But the storage engine also needs to read key columns, a timestamp column, and internal columns. So we maintain a `StoreSchema` after projection in the `ProjectedSchema`.
The `Projection` struct is a helper struct to help compute the projected user schema and store schema.
So we can construct the following `ProjectedSchema`:
```json
{
"schema_to_read":{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_system",
"__sequence",
"__op_type"
],
"time_index":2,
"version":0
},
"row_key_end":3,
"user_column_end":4
},
"projected_user_schema":{
"column_schemas":[
"ts",
"usage_system"
],
"time_index":0
}
}
```
As you can see, `schema_to_read` doesn't contain the column `usage_user` that is not intended to be read (not in projection).
### ReadAdapter
As mentioned above, we can alter a table so the underlying files (SSTs) and memtables in the storage engine may have different schemas.
To simplify the logic of `ProjectedSchema`, we handle the difference between schemas before projection (constructing the `ProjectedSchema`). We introduce [ReadAdapter](https://github.com/GreptimeTeam/greptimedb/blob/9fa871a3fad07f583dc1863a509414da393747f8/src/storage/src/schema/compat.rs#L90) that adapts rows with different source schemas to the same expected schema.
So we can always use the current `RegionSchema` of the region to construct the `ProjectedSchema`, and then create a `ReadAdapter` for each memtable or SST.
```rust
#[derive(Debug)]
pubstructReadAdapter{
source_schema: StoreSchemaRef,
dest_schema: ProjectedSchemaRef,
indices_in_result: Vec<Option<usize>>,
is_source_needed: Vec<bool>,
}
```
For each column required by `dest_schema`, `indices_in_result` stores the index of that column in the row read from the source memtable or SST. If the source row doesn't contain that column, the index is `None`.
The field `is_source_needed` stores whether a column in the source memtable or SST is needed.
Suppose we add a new column `usage_idle` to the table `cpu`.
```sql
ALTERTABLEcpuADDCOLUMNusage_idleDOUBLE;
```
The new `StoreSchema` becomes:
```json
{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"usage_idle",
"__sequence",
"__op_type"
],
"time_index":2,
"version":1
},
"row_key_end":3,
"user_column_end":6
}
```
Note that we bump the version of the schema to 1.
If we want to select `ts`, `usage_system`, and `usage_idle`. While reading from the old schema, the storage engine creates a `ReadAdapter` like this:
```json
{
"source_schema":{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_user",
"usage_system",
"__sequence",
"__op_type"
],
"time_index":2,
"version":0
},
"row_key_end":3,
"user_column_end":5
},
"dest_schema":{
"schema_to_read":{
"schema":{
"column_schemas":[
"datacenter",
"host",
"ts",
"usage_system",
"usage_idle",
"__sequence",
"__op_type"
],
"time_index":2,
"version":1
},
"row_key_end":3,
"user_column_end":5
},
"projected_user_schema":{
"column_schemas":[
"ts",
"usage_system",
"usage_idle"
],
"time_index":0
}
},
"indices_in_result":[
0,
1,
2,
3,
null,
4,
5
],
"is_source_needed":[
true,
true,
true,
false,
true,
true,
true
]
}
```
We don't need to read `usage_user` so `is_source_needed[3]` is false. The old schema doesn't have column `usage_idle` so `indices_in_result[4]` is `null` and the `ReadAdapter` needs to insert a null column to the output row so the output schema still contains `usage_idle`.
The figure below shows the relationship between `RegionSchema`, `StoreSchema`, `ProjectedSchema`, and `ReadAdapter`.
```text
┌──────────────────────────────┐
│ │
│ ┌────────────────────┐ │
│ │ store_schema │ │
│ │ │ │
│ │ StoreSchema │ │
│ │ version 1 │ │
│ └────────────────────┘ │
│ │
│ ┌────────────────────┐ │
│ │ user_schema │ │
│ └────────────────────┘ │
│ │
│ RegionSchema │
│ │
└──────────────┬───────────────┘
│
│
│
┌──────────────▼───────────────┐
│ │
│ ┌──────────────────────────┐ │
│ │ schema_to_read │ │
│ │ │ │
│ │ StoreSchema (projected) │ │
│ │ version 1 │ │
│ └──────────────────────────┘ │
┌───┤ ├───┐
│ │ ┌──────────────────────────┐ │ │
│ │ │ projected_user_schema │ │ │
│ │ └──────────────────────────┘ │ │
│ │ │ │
│ │ ProjectedSchema │ │
dest schema │ └──────────────────────────────┘ │ dest schema
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
syntax="proto3";
packageprometheus;
optiongo_package="prompb";
messageMetricMetadata{
enumMetricType{
UNKNOWN=0;
COUNTER=1;
GAUGE=2;
HISTOGRAM=3;
GAUGEHISTOGRAM=4;
SUMMARY=5;
INFO=6;
STATESET=7;
}
// Represents the metric type, these match the set from Prometheus.
// Refer to model/textparse/interface.go for details.
MetricTypetype=1;
stringmetric_family_name=2;
stringhelp=4;
stringunit=5;
}
messageSample{
doublevalue=1;
// timestamp is in ms format, see model/timestamp/timestamp.go for
// conversion from time.Time to Prometheus timestamp.
int64timestamp=2;
}
messageExemplar{
// Optional, can be empty.
repeatedLabellabels=1;
doublevalue=2;
// timestamp is in ms format, see model/timestamp/timestamp.go for
// conversion from time.Time to Prometheus timestamp.
int64timestamp=3;
}
// TimeSeries represents samples and labels for a single time series.
messageTimeSeries{
// For a timeseries to be valid, and for the samples and exemplars
// to be ingested by the remote system properly, the labels field is required.
repeatedLabellabels=1;
repeatedSamplesamples=2;
repeatedExemplarexemplars=3;
}
messageLabel{
stringname=1;
stringvalue=2;
}
messageLabels{
repeatedLabellabels=1;
}
// Matcher specifies a rule, which can match or set of labels or not.
messageLabelMatcher{
enumType{
EQ=0;
NEQ=1;
RE=2;
NRE=3;
}
Typetype=1;
stringname=2;
stringvalue=3;
}
messageReadHints{
int64step_ms=1;// Query step size in milliseconds.
stringfunc=2;// String representation of surrounding function or aggregation.
int64start_ms=3;// Start time in milliseconds.
int64end_ms=4;// End time in milliseconds.
repeatedstringgrouping=5;// List of label names used in aggregation.
boolby=6;// Indicate whether it is without or by.
int64range_ms=7;// Range vector selector range in milliseconds.
}
// Chunk represents a TSDB chunk.
// Time range [min, max] is inclusive.
messageChunk{
int64min_time_ms=1;
int64max_time_ms=2;
// We require this to match chunkenc.Encoding.
enumEncoding{
UNKNOWN=0;
XOR=1;
}
Encodingtype=3;
bytesdata=4;
}
// ChunkedSeries represents single, encoded time series.
messageChunkedSeries{
// Labels should be sorted.
repeatedLabellabels=1;
// Chunks will be in start time order and may overlap.
repeatedChunkchunks=2;
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.