Compare commits

..

20 Commits

Author SHA1 Message Date
discord9
334dbee590 ci: upload .pdb files too 2024-12-25 11:31:55 +08:00
Weny Xu
f33b378e45 chore: add log for converting region to follower (#5222)
* chore: add log for converting region to follower

* chore: apply suggestions from CR
2024-12-25 02:38:47 +00:00
zyy17
267941bbb5 ci: support to pack multiple files in upload-artifacts action (#5228) 2024-12-25 02:37:32 +00:00
Lei, HUANG
074846bbc2 feat(mito): parquet memtable reader (#4967)
* wip: row group reader base

* wip: memtable row group reader

* Refactor MemtableRowGroupReader to streamline data fetching

 - Added early return when fetch_ranges is empty to optimize performance.
 - Replaced inline chunk data assignment with a call to `assign_dense_chunk` for cleaner code.

* wip: row group reader

* wip: reuse RowGroupReader

* wip: bulk part reader

* Enhance BulkPart Iteration with Filtering

 - Introduced `RangeBase` to `BulkIterContext` for improved filter handling.
 - Implemented filter application in `BulkPartIter` to prune batches based on predicates.
 - Updated `SimpleFilterContext::new_opt` to be public for broader access.

* chore: add prune test

* fix: clippy

* fix: introduce prune reader for memtable and add more prune test

* Enhance BulkPart read method to return Option<BoxedBatchIterator>

 - Modified `BulkPart::read` to return `Option<BoxedBatchIterator>` to handle cases where no row groups are selected.
 - Added logic to return `None` when all row groups are filtered out.
 - Updated tests to handle the new return type and added a test case to verify behavior when no row groups match the pr

* refactor/separate-paraquet-reader: Add helper function to parse parquet metadata and integrate it into BulkPartEncoder

* refactor/separate-paraquet-reader:
 Change BulkPartEncoder row_group_size from Option to usize and update tests

* refactor/separate-paraquet-reader: Add context module for bulk memtable iteration and refactor part reading

 • Introduce context module to encapsulate context for bulk memtable iteration.
 • Refactor BulkPart to use BulkIterContextRef for reading operations.
 • Remove redundant code in BulkPart by centralizing context creation and row group pruning logic in the new context module.
 • Create new file context.rs with structures and logic for handling iteration context.
 • Adjust part_reader.rs and row_group_reader.rs to reference the new BulkIterContextRef.

* refactor/separate-paraquet-reader: Refactor RowGroupReader traits and implementations in memtable and parquet reader modules

 • Rename RowGroupReaderVirtual to RowGroupReaderContext for clarity.
 • Replace BulkPartVirt with direct usage of BulkIterContextRef in MemtableRowGroupReader.
 • Simplify MemtableRowGroupReaderBuilder by directly passing context instead of creating a BulkPartVirt instance.
 • Update RowGroupReaderBase to use context field instead of virt, reflecting the trait renaming and usage.
 • Modify FileRangeVirt to FileRangeContextRef and adjust implementations accordingly.

* refactor/separate-paraquet-reader: Refactor column page reader creation and remove unused code

 • Centralize creation of SerializedPageReader in RowGroupBase::column_reader method.
 • Remove unused RowGroupCachedReader and related code from MemtableRowGroupPageFetcher.
 • Eliminate redundant error handling for invalid column index in multiple places.

* chore: rebase main and resolve conflicts

* fix: some comments

* chore: resolve conflicts

* chore: resolve conflicts
2024-12-24 09:59:26 +00:00
Ruihang Xia
88d46a38ae chore: bump opendal to fork version to fix prometheus layer (#5223)
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
2024-12-24 08:54:59 +00:00
Weny Xu
de0beabf34 refactor: remove unnecessary wrap (#5221)
* chore: remove unnecessary arc

* chore: remove unnecessary box
2024-12-24 08:43:14 +00:00
Ruihang Xia
68dd2916fb feat: logs query endpoint (#5202)
* define endpoint

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* planner

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* update lock file

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* add unit test

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* fix toml format

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* revert metric change

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* Update src/query/src/log_query/planner.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix compile

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

* refactor and tests

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>

---------

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2024-12-24 06:21:19 +00:00
Zhenchi
d51b65a8bf feat(index-cache): abstract IndexCache to be shared by multi types of indexes (#5219)
* feat(index-cache): abstract `IndexCache` to be shared by multi types of indexes

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

* fix typo

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

* fix: remove added label

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

* refactor: simplify cached reader impl

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

* rename func

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

---------

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>
2024-12-24 05:10:30 +00:00
zyy17
2082c4b6e4 docs: add greptimedb-operator project link in 'Tools & Extensions' and other small improvements (#5216) 2024-12-24 03:09:41 +00:00
Ning Sun
c623404fff ci: fix nightly ci task on nix build (#5198) 2024-12-21 10:09:32 +00:00
Yingwen
fa3b7ed5ea build: use 8xlarge as arm default (#5214) 2024-12-21 08:39:24 +00:00
Yiran
8ece853076 fix: dead links (#5212) 2024-12-20 12:01:57 +00:00
Zhenchi
4245bff8f2 feat(bloom-filter): add bloom filter reader (#5204)
* feat(bloom-filter): add bloom filter reader

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

* chore: remove unused dep

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

* fix conflict

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

* address comments

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

---------

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>
2024-12-20 08:29:18 +00:00
Zhenchi
3d4121aefb feat(bloom-filter): add memory control for creator (#5185)
* feat(bloom-filter): add memory control for creator

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

* refactor: remove meaningless buf

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

* feat: add codec for intermediate

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>

---------

Signed-off-by: Zhenchi <zhongzc_arch@outlook.com>
2024-12-20 06:59:44 +00:00
Weny Xu
1910d71cb3 chore: adjust fuzz tests cfg (#5207) 2024-12-20 06:58:51 +00:00
LFC
a578eea801 ci: install latest protobuf in dev-builder image (#5196) 2024-12-20 02:45:53 +00:00
discord9
6bf574f098 fix: auto created table ttl check (#5203)
* fix: auto created table ttl check

* tests: with hint
2024-12-19 11:23:01 +00:00
discord9
a4d61bcaf1 fix(flow): batch builder with type (#5195)
* fix: typed builder

* chore: clippy

* chore: rename

* fix: unit tests

* refactor: per review
2024-12-19 09:16:56 +00:00
dennis zhuang
7ea8a44d3a chore: update PR template (#5199) 2024-12-19 08:28:20 +00:00
discord9
2d6f63a504 feat: show flow's mem usage in INFORMATION_SCHEMA.FLOWS (#4890)
* feat: add flow mem size to sys table

* chore: rm dup def

* chore: remove unused variant

* chore: minor refactor

* refactor: per review
2024-12-19 08:24:04 +00:00
139 changed files with 5004 additions and 1132 deletions

View File

@@ -54,7 +54,7 @@ runs:
PROFILE_TARGET: ${{ inputs.cargo-profile == 'dev' && 'debug' || inputs.cargo-profile }}
with:
artifacts-dir: ${{ inputs.artifacts-dir }}
target-file: ./target/$PROFILE_TARGET/greptime
target-files: ./target/$PROFILE_TARGET/greptime
version: ${{ inputs.version }}
working-dir: ${{ inputs.working-dir }}
@@ -72,6 +72,6 @@ runs:
if: ${{ inputs.build-android-artifacts == 'true' }}
with:
artifacts-dir: ${{ inputs.artifacts-dir }}
target-file: ./target/aarch64-linux-android/release/greptime
target-files: ./target/aarch64-linux-android/release/greptime
version: ${{ inputs.version }}
working-dir: ${{ inputs.working-dir }}

View File

@@ -90,5 +90,5 @@ runs:
uses: ./.github/actions/upload-artifacts
with:
artifacts-dir: ${{ inputs.artifacts-dir }}
target-file: target/${{ inputs.arch }}/${{ inputs.cargo-profile }}/greptime
target-files: target/${{ inputs.arch }}/${{ inputs.cargo-profile }}/greptime
version: ${{ inputs.version }}

View File

@@ -76,5 +76,5 @@ runs:
uses: ./.github/actions/upload-artifacts
with:
artifacts-dir: ${{ inputs.artifacts-dir }}
target-file: target/${{ inputs.arch }}/${{ inputs.cargo-profile }}/greptime
target-files: target/${{ inputs.arch }}/${{ inputs.cargo-profile }}/greptime,target/${{ inputs.arch }}/${{ inputs.cargo-profile }}/greptime.pdb
version: ${{ inputs.version }}

View File

@@ -5,7 +5,7 @@ meta:
[datanode]
[datanode.client]
timeout = "60s"
timeout = "120s"
datanode:
configData: |-
[runtime]
@@ -21,7 +21,7 @@ frontend:
global_rt_size = 4
[meta_client]
ddl_timeout = "60s"
ddl_timeout = "120s"
objectStorage:
s3:
bucket: default

View File

@@ -5,7 +5,7 @@ meta:
[datanode]
[datanode.client]
timeout = "60s"
timeout = "120s"
datanode:
configData: |-
[runtime]
@@ -17,7 +17,7 @@ frontend:
global_rt_size = 4
[meta_client]
ddl_timeout = "60s"
ddl_timeout = "120s"
objectStorage:
s3:
bucket: default

View File

@@ -11,7 +11,7 @@ meta:
[datanode]
[datanode.client]
timeout = "60s"
timeout = "120s"
datanode:
configData: |-
[runtime]
@@ -28,7 +28,7 @@ frontend:
global_rt_size = 4
[meta_client]
ddl_timeout = "60s"
ddl_timeout = "120s"
objectStorage:
s3:
bucket: default

View File

@@ -4,8 +4,8 @@ inputs:
artifacts-dir:
description: Directory to store artifacts
required: true
target-file:
description: The path of the target artifact
target-files:
description: The multiple target files to upload, separated by comma
required: false
version:
description: Version of the artifact
@@ -18,12 +18,16 @@ runs:
using: composite
steps:
- name: Create artifacts directory
if: ${{ inputs.target-file != '' }}
if: ${{ inputs.target-files != '' }}
working-directory: ${{ inputs.working-dir }}
shell: bash
run: |
mkdir -p ${{ inputs.artifacts-dir }} && \
cp ${{ inputs.target-file }} ${{ inputs.artifacts-dir }}
set -e
mkdir -p ${{ inputs.artifacts-dir }}
IFS=',' read -ra FILES <<< "${{ inputs.target-files }}"
for file in "${FILES[@]}"; do
cp "$file" ${{ inputs.artifacts-dir }}/
done
# The compressed artifacts will use the following layout:
# greptime-linux-amd64-pyo3-v0.3.0sha256sum

View File

@@ -4,7 +4,8 @@ I hereby agree to the terms of the [GreptimeDB CLA](https://github.com/GreptimeT
## What's changed and what's your intention?
__!!! DO NOT LEAVE THIS BLOCK EMPTY !!!__
<!--
__!!! DO NOT LEAVE THIS BLOCK EMPTY !!!__
Please explain IN DETAIL what the changes are in this PR and why they are needed:
@@ -12,9 +13,14 @@ Please explain IN DETAIL what the changes are in this PR and why they are needed
- How does this PR work? Need a brief introduction for the changed logic (optional)
- Describe clearly one logical change and avoid lazy messages (optional)
- Describe any limitations of the current code (optional)
- Describe if this PR will break **API or data compatibility** (optional)
-->
## Checklist
## PR Checklist
Please convert it to a draft if some of the following conditions are not met.
- [ ] I have written the necessary rustdoc comments.
- [ ] I have added the necessary unit tests and integration tests.
- [ ] This PR requires documentation updates.
- [ ] API changes are backward compatible.
- [ ] Schema or data changes are backward compatible.

View File

@@ -117,7 +117,6 @@ jobs:
cleanbuild-linux-nix:
runs-on: ubuntu-latest-8-cores
timeout-minutes: 60
needs: [coverage, fmt, clippy, check]
steps:
- uses: actions/checkout@v4
- uses: cachix/install-nix-action@v27

View File

@@ -31,7 +31,7 @@ on:
linux_arm64_runner:
type: choice
description: The runner uses to build linux-arm64 artifacts
default: ec2-c6g.4xlarge-arm64
default: ec2-c6g.8xlarge-arm64
options:
- ubuntu-2204-32-cores-arm
- ec2-c6g.xlarge-arm64 # 4C8G

147
Cargo.lock generated
View File

@@ -730,6 +730,36 @@ version = "1.1.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1505bd5d3d116872e7271a6d4e16d81d0c8570876c8de68093a09ac269d8aac0"
[[package]]
name = "attribute-derive"
version = "0.10.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f1800e974930e9079c965b9ffbcb6667a40401063a26396c7b4f15edc92da690"
dependencies = [
"attribute-derive-macro",
"derive-where",
"manyhow",
"proc-macro2",
"quote",
"syn 2.0.90",
]
[[package]]
name = "attribute-derive-macro"
version = "0.10.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5d908eb786ef94296bff86f90130b3b748b49401dc81fd2bb8b3dccd44cfacbd"
dependencies = [
"collection_literals",
"interpolator",
"manyhow",
"proc-macro-utils",
"proc-macro2",
"quote",
"quote-use",
"syn 2.0.90",
]
[[package]]
name = "atty"
version = "0.2.14"
@@ -1845,6 +1875,12 @@ dependencies = [
"tracing-appender",
]
[[package]]
name = "collection_literals"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "186dce98367766de751c42c4f03970fc60fc012296e706ccbb9d5df9b6c1e271"
[[package]]
name = "colorchoice"
version = "1.0.2"
@@ -3346,6 +3382,17 @@ dependencies = [
"syn 2.0.90",
]
[[package]]
name = "derive-where"
version = "1.2.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "62d671cc41a825ebabc75757b62d3d168c577f9149b2d49ece1dad1f72119d25"
dependencies = [
"proc-macro2",
"quote",
"syn 2.0.90",
]
[[package]]
name = "derive_arbitrary"
version = "1.3.2"
@@ -4011,6 +4058,8 @@ dependencies = [
"enum-as-inner",
"enum_dispatch",
"futures",
"get-size-derive2",
"get-size2",
"greptime-proto",
"hydroflow",
"itertools 0.10.5",
@@ -4103,6 +4152,7 @@ dependencies = [
"futures",
"humantime-serde",
"lazy_static",
"log-query",
"log-store",
"meta-client",
"opentelemetry-proto 0.5.0",
@@ -4415,6 +4465,23 @@ dependencies = [
"libm",
]
[[package]]
name = "get-size-derive2"
version = "0.1.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1fd26d3a97ea14d289c8b54180243ecfe465f3fa9c279a6336d7a003698fc39d"
dependencies = [
"attribute-derive",
"quote",
"syn 2.0.90",
]
[[package]]
name = "get-size2"
version = "0.1.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "159c430715e540d2198fa981d39cd45563ccc60900de187f5b152b33b1cb408e"
[[package]]
name = "gethostname"
version = "0.2.3"
@@ -5346,6 +5413,12 @@ version = "4.0.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0d762194228a2f1c11063e46e32e5acb96e66e906382b9eb5441f2e0504bbd5a"
[[package]]
name = "interpolator"
version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "71dd52191aae121e8611f1e8dc3e324dd0dd1dee1e6dd91d10ee07a3cfb4d9d8"
[[package]]
name = "inventory"
version = "0.3.15"
@@ -6050,6 +6123,7 @@ dependencies = [
"chrono",
"common-error",
"common-macro",
"serde",
"snafu 0.8.5",
"table",
]
@@ -6244,6 +6318,29 @@ dependencies = [
"libc",
]
[[package]]
name = "manyhow"
version = "0.11.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b33efb3ca6d3b07393750d4030418d594ab1139cee518f0dc88db70fec873587"
dependencies = [
"manyhow-macros",
"proc-macro2",
"quote",
"syn 2.0.90",
]
[[package]]
name = "manyhow-macros"
version = "0.11.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "46fce34d199b78b6e6073abf984c9cf5fd3e9330145a93ee0738a7443e371495"
dependencies = [
"proc-macro-utils",
"proc-macro2",
"quote",
]
[[package]]
name = "maplit"
version = "1.0.2"
@@ -7375,8 +7472,7 @@ checksum = "b410bbe7e14ab526a0e86877eb47c6996a2bd7746f027ba551028c925390e4e9"
[[package]]
name = "opendal"
version = "0.50.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cb28bb6c64e116ceaf8dd4e87099d3cfea4a58e85e62b104fef74c91afba0f44"
source = "git+https://github.com/GreptimeTeam/opendal.git?rev=c82605177f2feec83e49dcaa537c505639d94024#c82605177f2feec83e49dcaa537c505639d94024"
dependencies = [
"anyhow",
"async-trait",
@@ -8065,7 +8161,7 @@ dependencies = [
"rand",
"ring 0.17.8",
"rust_decimal",
"thiserror 2.0.4",
"thiserror 2.0.6",
"tokio",
"tokio-rustls 0.26.0",
"tokio-util",
@@ -8528,6 +8624,17 @@ dependencies = [
"version_check",
]
[[package]]
name = "proc-macro-utils"
version = "0.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "eeaf08a13de400bc215877b5bdc088f241b12eb42f0a548d3390dc1c56bb7071"
dependencies = [
"proc-macro2",
"quote",
"smallvec",
]
[[package]]
name = "proc-macro2"
version = "1.0.92"
@@ -8992,6 +9099,7 @@ dependencies = [
"humantime",
"itertools 0.10.5",
"lazy_static",
"log-query",
"meter-core",
"meter-macros",
"num",
@@ -9107,6 +9215,28 @@ dependencies = [
"proc-macro2",
]
[[package]]
name = "quote-use"
version = "0.8.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9619db1197b497a36178cfc736dc96b271fe918875fbf1344c436a7e93d0321e"
dependencies = [
"quote",
"quote-use-macros",
]
[[package]]
name = "quote-use-macros"
version = "0.8.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "82ebfb7faafadc06a7ab141a6f67bcfb24cb8beb158c6fe933f2f035afa99f35"
dependencies = [
"proc-macro-utils",
"proc-macro2",
"quote",
"syn 2.0.90",
]
[[package]]
name = "radium"
version = "0.7.0"
@@ -10824,6 +10954,7 @@ dependencies = [
"json5",
"jsonb",
"lazy_static",
"log-query",
"loki-api",
"mime_guess",
"mysql_async",
@@ -12306,11 +12437,11 @@ dependencies = [
[[package]]
name = "thiserror"
version = "2.0.4"
version = "2.0.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2f49a1853cf82743e3b7950f77e0f4d622ca36cf4317cba00c767838bac8d490"
checksum = "8fec2a1820ebd077e2b90c4df007bebf344cd394098a13c563957d0afc83ea47"
dependencies = [
"thiserror-impl 2.0.4",
"thiserror-impl 2.0.6",
]
[[package]]
@@ -12326,9 +12457,9 @@ dependencies = [
[[package]]
name = "thiserror-impl"
version = "2.0.4"
version = "2.0.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8381894bb3efe0c4acac3ded651301ceee58a15d47c2e34885ed1908ad667061"
checksum = "d65750cab40f4ff1929fb1ba509e9914eb756131cef4210da8d5d700d26f6312"
dependencies = [
"proc-macro2",
"quote",

View File

@@ -238,6 +238,7 @@ file-engine = { path = "src/file-engine" }
flow = { path = "src/flow" }
frontend = { path = "src/frontend", default-features = false }
index = { path = "src/index" }
log-query = { path = "src/log-query" }
log-store = { path = "src/log-store" }
meta-client = { path = "src/meta-client" }
meta-srv = { path = "src/meta-srv" }

View File

@@ -70,23 +70,23 @@ Our core developers have been building time-series data platforms for years. Bas
* **Unified Processing of Metrics, Logs, and Events**
GreptimeDB unifies time series data processing by treating all data - whether metrics, logs, or events - as timestamped events with context. Users can analyze this data using either [SQL](https://docs.greptime.com/user-guide/query-data/sql) or [PromQL](https://docs.greptime.com/user-guide/query-data/promql) and leverage stream processing ([Flow](https://docs.greptime.com/user-guide/continuous-aggregation/overview)) to enable continuous aggregation. [Read more](https://docs.greptime.com/user-guide/concepts/data-model).
GreptimeDB unifies time series data processing by treating all data - whether metrics, logs, or events - as timestamped events with context. Users can analyze this data using either [SQL](https://docs.greptime.com/user-guide/query-data/sql) or [PromQL](https://docs.greptime.com/user-guide/query-data/promql) and leverage stream processing ([Flow](https://docs.greptime.com/user-guide/flow-computation/overview)) to enable continuous aggregation. [Read more](https://docs.greptime.com/user-guide/concepts/data-model).
* **Cloud-native Distributed Database**
Built for [Kubernetes](https://docs.greptime.com/user-guide/deployments/deploy-on-kubernetes/greptimedb-operator-management). GreptimeDB achieves seamless scalability with its [cloud-native architecture](https://docs.greptime.com/user-guide/concepts/architecture) of separated compute and storage, built on object storage (AWS S3, Azure Blob Storage, etc.) while enabling cross-cloud deployment through a unified data access layer.
Built for [Kubernetes](https://docs.greptime.com/user-guide/deployments/deploy-on-kubernetes/greptimedb-operator-management). GreptimeDB achieves seamless scalability with its [cloud-native architecture](https://docs.greptime.com/user-guide/concepts/architecture) of separated compute and storage, built on object storage (AWS S3, Azure Blob Storage, etc.) while enabling cross-cloud deployment through a unified data access layer.
* **Performance and Cost-effective**
Written in pure Rust for superior performance and reliability. GreptimeDB features a distributed query engine with intelligent indexing to handle high cardinality data efficiently. Its optimized columnar storage achieves 50x cost efficiency on cloud object storage through advanced compression. [Benchmark reports](https://www.greptime.com/blogs/2024-09-09-report-summary).
Written in pure Rust for superior performance and reliability. GreptimeDB features a distributed query engine with intelligent indexing to handle high cardinality data efficiently. Its optimized columnar storage achieves 50x cost efficiency on cloud object storage through advanced compression. [Benchmark reports](https://www.greptime.com/blogs/2024-09-09-report-summary).
* **Cloud-Edge Collaboration**
GreptimeDB seamlessly operates across cloud and edge (ARM/Android/Linux), providing consistent APIs and control plane for unified data management and efficient synchronization. [Learn how to run on Android](https://docs.greptime.com/user-guide/deployments/run-on-android/).
GreptimeDB seamlessly operates across cloud and edge (ARM/Android/Linux), providing consistent APIs and control plane for unified data management and efficient synchronization. [Learn how to run on Android](https://docs.greptime.com/user-guide/deployments/run-on-android/).
* **Multi-protocol Ingestion, SQL & PromQL Ready**
Widely adopted database protocols and APIs, including MySQL, PostgreSQL, InfluxDB, OpenTelemetry, Loki and Prometheus, etc. Effortless Adoption & Seamless Migration. [Supported Protocols Overview](https://docs.greptime.com/user-guide/protocols/overview).
Widely adopted database protocols and APIs, including MySQL, PostgreSQL, InfluxDB, OpenTelemetry, Loki and Prometheus, etc. Effortless Adoption & Seamless Migration. [Supported Protocols Overview](https://docs.greptime.com/user-guide/protocols/overview).
For more detailed info please read [Why GreptimeDB](https://docs.greptime.com/user-guide/concepts/why-greptimedb).
@@ -138,7 +138,7 @@ Check the prerequisite:
* [Rust toolchain](https://www.rust-lang.org/tools/install) (nightly)
* [Protobuf compiler](https://grpc.io/docs/protoc-installation/) (>= 3.15)
* Python toolchain (optional): Required only if built with PyO3 backend. More detail for compiling with PyO3 can be found in its [documentation](https://pyo3.rs/v0.18.1/building_and_distribution#configuring-the-python-version).
* Python toolchain (optional): Required only if built with PyO3 backend. More details for compiling with PyO3 can be found in its [documentation](https://pyo3.rs/v0.18.1/building_and_distribution#configuring-the-python-version).
Build GreptimeDB binary:
@@ -154,6 +154,10 @@ cargo run -- standalone start
## Tools & Extensions
### Kubernetes
- [GreptimeDB Operator](https://github.com/GrepTimeTeam/greptimedb-operator)
### Dashboard
- [The dashboard UI for GreptimeDB](https://github.com/GreptimeTeam/dashboard)
@@ -173,7 +177,7 @@ Our official Grafana dashboard for monitoring GreptimeDB is available at [grafan
## Project Status
GreptimeDB is currently in Beta. We are targeting GA (General Availability) with v1.0 release by Early 2025.
GreptimeDB is currently in Beta. We are targeting GA (General Availability) with v1.0 release by Early 2025.
While in Beta, GreptimeDB is already:

View File

@@ -15,8 +15,8 @@ RUN apt-get update && \
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
libssl-dev \
tzdata \
protobuf-compiler \
curl \
unzip \
ca-certificates \
git \
build-essential \
@@ -24,6 +24,20 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
python3.10 \
python3.10-dev
ARG TARGETPLATFORM
RUN echo "target platform: $TARGETPLATFORM"
# Install protobuf, because the one in the apt is too old (v3.12).
RUN if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v29.1/protoc-29.1-linux-aarch_64.zip && \
unzip protoc-29.1-linux-aarch_64.zip -d protoc3; \
elif [ "$TARGETPLATFORM" = "linux/amd64" ]; then \
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v29.1/protoc-29.1-linux-x86_64.zip && \
unzip protoc-29.1-linux-x86_64.zip -d protoc3; \
fi
RUN mv protoc3/bin/* /usr/local/bin/
RUN mv protoc3/include/* /usr/local/include/
# https://github.com/GreptimeTeam/greptimedb/actions/runs/10935485852/job/30357457188#step:3:7106
# `aws-lc-sys` require gcc >= 10.3.0 to work, hence alias to use gcc-10
RUN apt-get remove -y gcc-9 g++-9 cpp-9 && \
@@ -49,7 +63,7 @@ RUN apt-get -y purge python3.8 && \
# wildcard here. However, that requires the git's config files and the submodules all owned by the very same user.
# It's troublesome to do this since the dev build runs in Docker, which is under user "root"; while outside the Docker,
# it can be a different user that have prepared the submodules.
RUN git config --global --add safe.directory *
RUN git config --global --add safe.directory '*'
# Install Python dependencies.
COPY $DOCKER_BUILD_ROOT/docker/python/requirements.txt /etc/greptime/requirements.txt

View File

@@ -25,6 +25,7 @@ pub enum PermissionReq<'a> {
GrpcRequest(&'a Request),
SqlStatement(&'a Statement),
PromQuery,
LogQuery,
Opentsdb,
LineProtocol,
PromStoreWrite,

View File

@@ -64,6 +64,13 @@ pub enum Error {
source: BoxedError,
},
#[snafu(display("Failed to list flow stats"))]
ListFlowStats {
#[snafu(implicit)]
location: Location,
source: BoxedError,
},
#[snafu(display("Failed to list flows in catalog {catalog}"))]
ListFlows {
#[snafu(implicit)]
@@ -326,6 +333,7 @@ impl ErrorExt for Error {
| Error::ListSchemas { source, .. }
| Error::ListTables { source, .. }
| Error::ListFlows { source, .. }
| Error::ListFlowStats { source, .. }
| Error::ListProcedures { source, .. }
| Error::ListRegionStats { source, .. }
| Error::ConvertProtoData { source, .. } => source.status_code(),

View File

@@ -17,6 +17,7 @@ use common_error::ext::BoxedError;
use common_meta::cluster::{ClusterInfo, NodeInfo};
use common_meta::datanode::RegionStat;
use common_meta::ddl::{ExecutorContext, ProcedureExecutor};
use common_meta::key::flow::flow_state::FlowStat;
use common_meta::rpc::procedure;
use common_procedure::{ProcedureInfo, ProcedureState};
use meta_client::MetaClientRef;
@@ -89,4 +90,12 @@ impl InformationExtension for DistributedInformationExtension {
.map_err(BoxedError::new)
.context(error::ListRegionStatsSnafu)
}
async fn flow_stats(&self) -> std::result::Result<Option<FlowStat>, Self::Error> {
self.meta_client
.list_flow_stats()
.await
.map_err(BoxedError::new)
.context(crate::error::ListFlowStatsSnafu)
}
}

View File

@@ -38,7 +38,7 @@ pub fn new_table_cache(
) -> TableCache {
let init = init_factory(table_info_cache, table_name_cache);
CacheContainer::new(name, cache, Box::new(invalidator), init, Box::new(filter))
CacheContainer::new(name, cache, Box::new(invalidator), init, filter)
}
fn init_factory(

View File

@@ -35,6 +35,7 @@ use common_catalog::consts::{self, DEFAULT_CATALOG_NAME, INFORMATION_SCHEMA_NAME
use common_error::ext::ErrorExt;
use common_meta::cluster::NodeInfo;
use common_meta::datanode::RegionStat;
use common_meta::key::flow::flow_state::FlowStat;
use common_meta::key::flow::FlowMetadataManager;
use common_procedure::ProcedureInfo;
use common_recordbatch::SendableRecordBatchStream;
@@ -192,6 +193,7 @@ impl SystemSchemaProviderInner for InformationSchemaProvider {
)) as _),
FLOWS => Some(Arc::new(InformationSchemaFlows::new(
self.catalog_name.clone(),
self.catalog_manager.clone(),
self.flow_metadata_manager.clone(),
)) as _),
PROCEDURE_INFO => Some(
@@ -338,6 +340,9 @@ pub trait InformationExtension {
/// Gets the region statistics.
async fn region_stats(&self) -> std::result::Result<Vec<RegionStat>, Self::Error>;
/// Get the flow statistics. If no flownode is available, return `None`.
async fn flow_stats(&self) -> std::result::Result<Option<FlowStat>, Self::Error>;
}
pub struct NoopInformationExtension;
@@ -357,4 +362,8 @@ impl InformationExtension for NoopInformationExtension {
async fn region_stats(&self) -> std::result::Result<Vec<RegionStat>, Self::Error> {
Ok(vec![])
}
async fn flow_stats(&self) -> std::result::Result<Option<FlowStat>, Self::Error> {
Ok(None)
}
}

View File

@@ -12,11 +12,12 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::sync::Arc;
use std::sync::{Arc, Weak};
use common_catalog::consts::INFORMATION_SCHEMA_FLOW_TABLE_ID;
use common_error::ext::BoxedError;
use common_meta::key::flow::flow_info::FlowInfoValue;
use common_meta::key::flow::flow_state::FlowStat;
use common_meta::key::flow::FlowMetadataManager;
use common_meta::key::FlowId;
use common_recordbatch::adapter::RecordBatchStreamAdapter;
@@ -28,7 +29,9 @@ use datatypes::prelude::ConcreteDataType as CDT;
use datatypes::scalars::ScalarVectorBuilder;
use datatypes::schema::{ColumnSchema, Schema, SchemaRef};
use datatypes::value::Value;
use datatypes::vectors::{Int64VectorBuilder, StringVectorBuilder, UInt32VectorBuilder, VectorRef};
use datatypes::vectors::{
Int64VectorBuilder, StringVectorBuilder, UInt32VectorBuilder, UInt64VectorBuilder, VectorRef,
};
use futures::TryStreamExt;
use snafu::{OptionExt, ResultExt};
use store_api::storage::{ScanRequest, TableId};
@@ -38,6 +41,8 @@ use crate::error::{
};
use crate::information_schema::{Predicates, FLOWS};
use crate::system_schema::information_schema::InformationTable;
use crate::system_schema::utils;
use crate::CatalogManager;
const INIT_CAPACITY: usize = 42;
@@ -45,6 +50,7 @@ const INIT_CAPACITY: usize = 42;
// pk is (flow_name, flow_id, table_catalog)
pub const FLOW_NAME: &str = "flow_name";
pub const FLOW_ID: &str = "flow_id";
pub const STATE_SIZE: &str = "state_size";
pub const TABLE_CATALOG: &str = "table_catalog";
pub const FLOW_DEFINITION: &str = "flow_definition";
pub const COMMENT: &str = "comment";
@@ -55,20 +61,24 @@ pub const FLOWNODE_IDS: &str = "flownode_ids";
pub const OPTIONS: &str = "options";
/// The `information_schema.flows` to provides information about flows in databases.
///
pub(super) struct InformationSchemaFlows {
schema: SchemaRef,
catalog_name: String,
catalog_manager: Weak<dyn CatalogManager>,
flow_metadata_manager: Arc<FlowMetadataManager>,
}
impl InformationSchemaFlows {
pub(super) fn new(
catalog_name: String,
catalog_manager: Weak<dyn CatalogManager>,
flow_metadata_manager: Arc<FlowMetadataManager>,
) -> Self {
Self {
schema: Self::schema(),
catalog_name,
catalog_manager,
flow_metadata_manager,
}
}
@@ -80,6 +90,7 @@ impl InformationSchemaFlows {
vec![
(FLOW_NAME, CDT::string_datatype(), false),
(FLOW_ID, CDT::uint32_datatype(), false),
(STATE_SIZE, CDT::uint64_datatype(), true),
(TABLE_CATALOG, CDT::string_datatype(), false),
(FLOW_DEFINITION, CDT::string_datatype(), false),
(COMMENT, CDT::string_datatype(), true),
@@ -99,6 +110,7 @@ impl InformationSchemaFlows {
InformationSchemaFlowsBuilder::new(
self.schema.clone(),
self.catalog_name.clone(),
self.catalog_manager.clone(),
&self.flow_metadata_manager,
)
}
@@ -144,10 +156,12 @@ impl InformationTable for InformationSchemaFlows {
struct InformationSchemaFlowsBuilder {
schema: SchemaRef,
catalog_name: String,
catalog_manager: Weak<dyn CatalogManager>,
flow_metadata_manager: Arc<FlowMetadataManager>,
flow_names: StringVectorBuilder,
flow_ids: UInt32VectorBuilder,
state_sizes: UInt64VectorBuilder,
table_catalogs: StringVectorBuilder,
raw_sqls: StringVectorBuilder,
comments: StringVectorBuilder,
@@ -162,15 +176,18 @@ impl InformationSchemaFlowsBuilder {
fn new(
schema: SchemaRef,
catalog_name: String,
catalog_manager: Weak<dyn CatalogManager>,
flow_metadata_manager: &Arc<FlowMetadataManager>,
) -> Self {
Self {
schema,
catalog_name,
catalog_manager,
flow_metadata_manager: flow_metadata_manager.clone(),
flow_names: StringVectorBuilder::with_capacity(INIT_CAPACITY),
flow_ids: UInt32VectorBuilder::with_capacity(INIT_CAPACITY),
state_sizes: UInt64VectorBuilder::with_capacity(INIT_CAPACITY),
table_catalogs: StringVectorBuilder::with_capacity(INIT_CAPACITY),
raw_sqls: StringVectorBuilder::with_capacity(INIT_CAPACITY),
comments: StringVectorBuilder::with_capacity(INIT_CAPACITY),
@@ -195,6 +212,11 @@ impl InformationSchemaFlowsBuilder {
.flow_names(&catalog_name)
.await;
let flow_stat = {
let information_extension = utils::information_extension(&self.catalog_manager)?;
information_extension.flow_stats().await?
};
while let Some((flow_name, flow_id)) = stream
.try_next()
.await
@@ -213,7 +235,7 @@ impl InformationSchemaFlowsBuilder {
catalog_name: catalog_name.to_string(),
flow_name: flow_name.to_string(),
})?;
self.add_flow(&predicates, flow_id.flow_id(), flow_info)?;
self.add_flow(&predicates, flow_id.flow_id(), flow_info, &flow_stat)?;
}
self.finish()
@@ -224,6 +246,7 @@ impl InformationSchemaFlowsBuilder {
predicates: &Predicates,
flow_id: FlowId,
flow_info: FlowInfoValue,
flow_stat: &Option<FlowStat>,
) -> Result<()> {
let row = [
(FLOW_NAME, &Value::from(flow_info.flow_name().to_string())),
@@ -238,6 +261,11 @@ impl InformationSchemaFlowsBuilder {
}
self.flow_names.push(Some(flow_info.flow_name()));
self.flow_ids.push(Some(flow_id));
self.state_sizes.push(
flow_stat
.as_ref()
.and_then(|state| state.state_size.get(&flow_id).map(|v| *v as u64)),
);
self.table_catalogs.push(Some(flow_info.catalog_name()));
self.raw_sqls.push(Some(flow_info.raw_sql()));
self.comments.push(Some(flow_info.comment()));
@@ -270,6 +298,7 @@ impl InformationSchemaFlowsBuilder {
let columns: Vec<VectorRef> = vec![
Arc::new(self.flow_names.finish()),
Arc::new(self.flow_ids.finish()),
Arc::new(self.state_sizes.finish()),
Arc::new(self.table_catalogs.finish()),
Arc::new(self.raw_sqls.finish()),
Arc::new(self.comments.finish()),

View File

@@ -34,7 +34,7 @@ use common_query::Output;
use common_recordbatch::RecordBatches;
use common_telemetry::debug;
use either::Either;
use meta_client::client::MetaClientBuilder;
use meta_client::client::{ClusterKvBackend, MetaClientBuilder};
use query::datafusion::DatafusionQueryEngine;
use query::parser::QueryLanguageParser;
use query::query_engine::{DefaultSerializer, QueryEngineState};

View File

@@ -34,6 +34,7 @@ use common_meta::ddl::flow_meta::{FlowMetadataAllocator, FlowMetadataAllocatorRe
use common_meta::ddl::table_meta::{TableMetadataAllocator, TableMetadataAllocatorRef};
use common_meta::ddl::{DdlContext, NoopRegionFailureDetectorControl, ProcedureExecutorRef};
use common_meta::ddl_manager::DdlManager;
use common_meta::key::flow::flow_state::FlowStat;
use common_meta::key::flow::{FlowMetadataManager, FlowMetadataManagerRef};
use common_meta::key::{TableMetadataManager, TableMetadataManagerRef};
use common_meta::kv_backend::KvBackendRef;
@@ -70,7 +71,7 @@ use servers::http::HttpOptions;
use servers::tls::{TlsMode, TlsOption};
use servers::Mode;
use snafu::ResultExt;
use tokio::sync::broadcast;
use tokio::sync::{broadcast, RwLock};
use tracing_appender::non_blocking::WorkerGuard;
use crate::error::{
@@ -507,7 +508,7 @@ impl StartCommand {
procedure_manager.clone(),
));
let catalog_manager = KvBackendCatalogManager::new(
information_extension,
information_extension.clone(),
kv_backend.clone(),
layered_cache_registry.clone(),
Some(procedure_manager.clone()),
@@ -532,6 +533,14 @@ impl StartCommand {
.context(OtherSnafu)?,
);
// set the ref to query for the local flow state
{
let flow_worker_manager = flownode.flow_worker_manager();
information_extension
.set_flow_worker_manager(flow_worker_manager.clone())
.await;
}
let node_manager = Arc::new(StandaloneDatanodeManager {
region_server: datanode.region_server(),
flow_server: flownode.flow_worker_manager(),
@@ -669,6 +678,7 @@ pub struct StandaloneInformationExtension {
region_server: RegionServer,
procedure_manager: ProcedureManagerRef,
start_time_ms: u64,
flow_worker_manager: RwLock<Option<Arc<FlowWorkerManager>>>,
}
impl StandaloneInformationExtension {
@@ -677,8 +687,15 @@ impl StandaloneInformationExtension {
region_server,
procedure_manager,
start_time_ms: common_time::util::current_time_millis() as u64,
flow_worker_manager: RwLock::new(None),
}
}
/// Set the flow worker manager for the standalone instance.
pub async fn set_flow_worker_manager(&self, flow_worker_manager: Arc<FlowWorkerManager>) {
let mut guard = self.flow_worker_manager.write().await;
*guard = Some(flow_worker_manager);
}
}
#[async_trait::async_trait]
@@ -750,6 +767,18 @@ impl InformationExtension for StandaloneInformationExtension {
.collect::<Vec<_>>();
Ok(stats)
}
async fn flow_stats(&self) -> std::result::Result<Option<FlowStat>, Self::Error> {
Ok(Some(
self.flow_worker_manager
.read()
.await
.as_ref()
.unwrap()
.gen_state_report()
.await,
))
}
}
#[cfg(test)]

View File

@@ -26,3 +26,4 @@ pub mod function_registry;
pub mod handlers;
pub mod helper;
pub mod state;
pub mod utils;

View File

@@ -204,20 +204,10 @@ impl PatternAst {
fn convert_literal(column: &str, pattern: &str) -> Expr {
logical_expr::col(column).like(logical_expr::lit(format!(
"%{}%",
Self::escape_pattern(pattern)
crate::utils::escape_like_pattern(pattern)
)))
}
fn escape_pattern(pattern: &str) -> String {
pattern
.chars()
.flat_map(|c| match c {
'\\' | '%' | '_' => vec!['\\', c],
_ => vec![c],
})
.collect::<String>()
}
/// Transform this AST with preset rules to make it correct.
fn transform_ast(self) -> Result<Self> {
self.transform_up(Self::collapse_binary_branch_fn)

View File

@@ -0,0 +1,58 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
/// Escapes special characters in the provided pattern string for `LIKE`.
///
/// Specifically, it prefixes the backslash (`\`), percent (`%`), and underscore (`_`)
/// characters with an additional backslash to ensure they are treated literally.
///
/// # Examples
///
/// ```rust
/// let escaped = escape_pattern("100%_some\\path");
/// assert_eq!(escaped, "100\\%\\_some\\\\path");
/// ```
pub fn escape_like_pattern(pattern: &str) -> String {
pattern
.chars()
.flat_map(|c| match c {
'\\' | '%' | '_' => vec!['\\', c],
_ => vec![c],
})
.collect::<String>()
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_escape_like_pattern() {
assert_eq!(
escape_like_pattern("100%_some\\path"),
"100\\%\\_some\\\\path"
);
assert_eq!(escape_like_pattern(""), "");
assert_eq!(escape_like_pattern("hello"), "hello");
assert_eq!(escape_like_pattern("\\%_"), "\\\\\\%\\_");
assert_eq!(escape_like_pattern("%%__\\\\"), "\\%\\%\\_\\_\\\\\\\\");
assert_eq!(escape_like_pattern("abc123"), "abc123");
assert_eq!(escape_like_pattern("%_\\"), "\\%\\_\\\\");
assert_eq!(
escape_like_pattern("%%__\\\\another%string"),
"\\%\\%\\_\\_\\\\\\\\another\\%string"
);
assert_eq!(escape_like_pattern("foo%bar_"), "foo\\%bar\\_");
assert_eq!(escape_like_pattern("\\_\\%"), "\\\\\\_\\\\\\%");
}
}

View File

@@ -43,7 +43,7 @@ pub struct CacheContainer<K, V, CacheToken> {
cache: Cache<K, V>,
invalidator: Invalidator<K, V, CacheToken>,
initializer: Initializer<K, V>,
token_filter: TokenFilter<CacheToken>,
token_filter: fn(&CacheToken) -> bool,
}
impl<K, V, CacheToken> CacheContainer<K, V, CacheToken>
@@ -58,7 +58,7 @@ where
cache: Cache<K, V>,
invalidator: Invalidator<K, V, CacheToken>,
initializer: Initializer<K, V>,
token_filter: TokenFilter<CacheToken>,
token_filter: fn(&CacheToken) -> bool,
) -> Self {
Self {
name,
@@ -206,10 +206,13 @@ mod tests {
name: &'a str,
}
fn always_true_filter(_: &String) -> bool {
true
}
#[tokio::test]
async fn test_get() {
let cache: Cache<NameKey, String> = CacheBuilder::new(128).build();
let filter: TokenFilter<String> = Box::new(|_| true);
let counter = Arc::new(AtomicI32::new(0));
let moved_counter = counter.clone();
let init: Initializer<NameKey, String> = Arc::new(move |_| {
@@ -219,7 +222,13 @@ mod tests {
let invalidator: Invalidator<NameKey, String, String> =
Box::new(|_, _| Box::pin(async { Ok(()) }));
let adv_cache = CacheContainer::new("test".to_string(), cache, invalidator, init, filter);
let adv_cache = CacheContainer::new(
"test".to_string(),
cache,
invalidator,
init,
always_true_filter,
);
let key = NameKey { name: "key" };
let value = adv_cache.get(key).await.unwrap().unwrap();
assert_eq!(value, "hi");
@@ -233,7 +242,6 @@ mod tests {
#[tokio::test]
async fn test_get_by_ref() {
let cache: Cache<String, String> = CacheBuilder::new(128).build();
let filter: TokenFilter<String> = Box::new(|_| true);
let counter = Arc::new(AtomicI32::new(0));
let moved_counter = counter.clone();
let init: Initializer<String, String> = Arc::new(move |_| {
@@ -243,7 +251,13 @@ mod tests {
let invalidator: Invalidator<String, String, String> =
Box::new(|_, _| Box::pin(async { Ok(()) }));
let adv_cache = CacheContainer::new("test".to_string(), cache, invalidator, init, filter);
let adv_cache = CacheContainer::new(
"test".to_string(),
cache,
invalidator,
init,
always_true_filter,
);
let value = adv_cache.get_by_ref("foo").await.unwrap().unwrap();
assert_eq!(value, "hi");
let value = adv_cache.get_by_ref("foo").await.unwrap().unwrap();
@@ -257,13 +271,18 @@ mod tests {
#[tokio::test]
async fn test_get_value_not_exits() {
let cache: Cache<String, String> = CacheBuilder::new(128).build();
let filter: TokenFilter<String> = Box::new(|_| true);
let init: Initializer<String, String> =
Arc::new(move |_| Box::pin(async { error::ValueNotExistSnafu {}.fail() }));
let invalidator: Invalidator<String, String, String> =
Box::new(|_, _| Box::pin(async { Ok(()) }));
let adv_cache = CacheContainer::new("test".to_string(), cache, invalidator, init, filter);
let adv_cache = CacheContainer::new(
"test".to_string(),
cache,
invalidator,
init,
always_true_filter,
);
let value = adv_cache.get_by_ref("foo").await.unwrap();
assert!(value.is_none());
}
@@ -271,7 +290,6 @@ mod tests {
#[tokio::test]
async fn test_invalidate() {
let cache: Cache<String, String> = CacheBuilder::new(128).build();
let filter: TokenFilter<String> = Box::new(|_| true);
let counter = Arc::new(AtomicI32::new(0));
let moved_counter = counter.clone();
let init: Initializer<String, String> = Arc::new(move |_| {
@@ -285,7 +303,13 @@ mod tests {
})
});
let adv_cache = CacheContainer::new("test".to_string(), cache, invalidator, init, filter);
let adv_cache = CacheContainer::new(
"test".to_string(),
cache,
invalidator,
init,
always_true_filter,
);
let value = adv_cache.get_by_ref("foo").await.unwrap().unwrap();
assert_eq!(value, "hi");
let value = adv_cache.get_by_ref("foo").await.unwrap().unwrap();

View File

@@ -45,7 +45,7 @@ pub fn new_table_flownode_set_cache(
let table_flow_manager = Arc::new(TableFlowManager::new(kv_backend));
let init = init_factory(table_flow_manager);
CacheContainer::new(name, cache, Box::new(invalidator), init, Box::new(filter))
CacheContainer::new(name, cache, Box::new(invalidator), init, filter)
}
fn init_factory(table_flow_manager: TableFlowManagerRef) -> Initializer<TableId, FlownodeSet> {

View File

@@ -151,12 +151,15 @@ mod tests {
use crate::cache::*;
use crate::instruction::CacheIdent;
fn always_true_filter(_: &CacheIdent) -> bool {
true
}
fn test_cache(
name: &str,
invalidator: Invalidator<String, String, CacheIdent>,
) -> CacheContainer<String, String, CacheIdent> {
let cache: Cache<String, String> = CacheBuilder::new(128).build();
let filter: TokenFilter<CacheIdent> = Box::new(|_| true);
let counter = Arc::new(AtomicI32::new(0));
let moved_counter = counter.clone();
let init: Initializer<String, String> = Arc::new(move |_| {
@@ -164,7 +167,13 @@ mod tests {
Box::pin(async { Ok(Some("hi".to_string())) })
});
CacheContainer::new(name.to_string(), cache, invalidator, init, filter)
CacheContainer::new(
name.to_string(),
cache,
invalidator,
init,
always_true_filter,
)
}
fn test_i32_cache(
@@ -172,7 +181,6 @@ mod tests {
invalidator: Invalidator<i32, String, CacheIdent>,
) -> CacheContainer<i32, String, CacheIdent> {
let cache: Cache<i32, String> = CacheBuilder::new(128).build();
let filter: TokenFilter<CacheIdent> = Box::new(|_| true);
let counter = Arc::new(AtomicI32::new(0));
let moved_counter = counter.clone();
let init: Initializer<i32, String> = Arc::new(move |_| {
@@ -180,7 +188,13 @@ mod tests {
Box::pin(async { Ok(Some("foo".to_string())) })
});
CacheContainer::new(name.to_string(), cache, invalidator, init, filter)
CacheContainer::new(
name.to_string(),
cache,
invalidator,
init,
always_true_filter,
)
}
#[tokio::test]

View File

@@ -36,7 +36,7 @@ pub fn new_schema_cache(
let schema_manager = SchemaManager::new(kv_backend.clone());
let init = init_factory(schema_manager);
CacheContainer::new(name, cache, Box::new(invalidator), init, Box::new(filter))
CacheContainer::new(name, cache, Box::new(invalidator), init, filter)
}
fn init_factory(schema_manager: SchemaManager) -> Initializer<SchemaName, Arc<SchemaNameValue>> {

View File

@@ -41,7 +41,7 @@ pub fn new_table_info_cache(
let table_info_manager = Arc::new(TableInfoManager::new(kv_backend));
let init = init_factory(table_info_manager);
CacheContainer::new(name, cache, Box::new(invalidator), init, Box::new(filter))
CacheContainer::new(name, cache, Box::new(invalidator), init, filter)
}
fn init_factory(table_info_manager: TableInfoManagerRef) -> Initializer<TableId, Arc<TableInfo>> {

View File

@@ -41,7 +41,7 @@ pub fn new_table_name_cache(
let table_name_manager = Arc::new(TableNameManager::new(kv_backend));
let init = init_factory(table_name_manager);
CacheContainer::new(name, cache, Box::new(invalidator), init, Box::new(filter))
CacheContainer::new(name, cache, Box::new(invalidator), init, filter)
}
fn init_factory(table_name_manager: TableNameManagerRef) -> Initializer<TableName, TableId> {

View File

@@ -65,7 +65,7 @@ pub fn new_table_route_cache(
let table_info_manager = Arc::new(TableRouteManager::new(kv_backend));
let init = init_factory(table_info_manager);
CacheContainer::new(name, cache, Box::new(invalidator), init, Box::new(filter))
CacheContainer::new(name, cache, Box::new(invalidator), init, filter)
}
fn init_factory(

View File

@@ -40,7 +40,7 @@ pub fn new_table_schema_cache(
let table_info_manager = TableInfoManager::new(kv_backend);
let init = init_factory(table_info_manager);
CacheContainer::new(name, cache, Box::new(invalidator), init, Box::new(filter))
CacheContainer::new(name, cache, Box::new(invalidator), init, filter)
}
fn init_factory(table_info_manager: TableInfoManager) -> Initializer<TableId, Arc<SchemaName>> {

View File

@@ -40,7 +40,7 @@ pub fn new_view_info_cache(
let view_info_manager = Arc::new(ViewInfoManager::new(kv_backend));
let init = init_factory(view_info_manager);
CacheContainer::new(name, cache, Box::new(invalidator), init, Box::new(filter))
CacheContainer::new(name, cache, Box::new(invalidator), init, filter)
}
fn init_factory(view_info_manager: ViewInfoManagerRef) -> Initializer<TableId, Arc<ViewInfoValue>> {

View File

@@ -137,6 +137,7 @@ use self::schema_name::{SchemaManager, SchemaNameKey, SchemaNameValue};
use self::table_route::{TableRouteManager, TableRouteValue};
use self::tombstone::TombstoneManager;
use crate::error::{self, Result, SerdeJsonSnafu};
use crate::key::flow::flow_state::FlowStateValue;
use crate::key::node_address::NodeAddressValue;
use crate::key::table_route::TableRouteKey;
use crate::key::txn_helper::TxnOpGetResponseSet;
@@ -1262,7 +1263,8 @@ impl_metadata_value! {
FlowRouteValue,
TableFlowValue,
NodeAddressValue,
SchemaNameValue
SchemaNameValue,
FlowStateValue
}
impl_optional_metadata_value! {

View File

@@ -13,7 +13,6 @@
// limitations under the License.
use std::fmt::Display;
use std::sync::Arc;
use common_catalog::consts::DEFAULT_CATALOG_NAME;
use futures::stream::BoxStream;
@@ -146,7 +145,7 @@ impl CatalogManager {
self.kv_backend.clone(),
req,
DEFAULT_PAGE_SIZE,
Arc::new(catalog_decoder),
catalog_decoder,
)
.into_stream();
@@ -156,6 +155,8 @@ impl CatalogManager {
#[cfg(test)]
mod tests {
use std::sync::Arc;
use super::*;
use crate::kv_backend::memory::MemoryKvBackend;

View File

@@ -14,7 +14,6 @@
use std::collections::HashMap;
use std::fmt::Display;
use std::sync::Arc;
use futures::stream::BoxStream;
use serde::{Deserialize, Serialize};
@@ -166,7 +165,7 @@ impl DatanodeTableManager {
self.kv_backend.clone(),
req,
DEFAULT_PAGE_SIZE,
Arc::new(datanode_table_value_decoder),
datanode_table_value_decoder,
)
.into_stream();

View File

@@ -15,6 +15,7 @@
pub mod flow_info;
pub(crate) mod flow_name;
pub(crate) mod flow_route;
pub mod flow_state;
pub(crate) mod flownode_flow;
pub(crate) mod table_flow;
@@ -35,6 +36,7 @@ use crate::ensure_values;
use crate::error::{self, Result};
use crate::key::flow::flow_info::FlowInfoManager;
use crate::key::flow::flow_name::FlowNameManager;
use crate::key::flow::flow_state::FlowStateManager;
use crate::key::flow::flownode_flow::FlownodeFlowManager;
pub use crate::key::flow::table_flow::{TableFlowManager, TableFlowManagerRef};
use crate::key::txn_helper::TxnOpGetResponseSet;
@@ -102,6 +104,8 @@ pub struct FlowMetadataManager {
flownode_flow_manager: FlownodeFlowManager,
table_flow_manager: TableFlowManager,
flow_name_manager: FlowNameManager,
/// only metasrv have access to itself's memory backend, so for other case it should be None
flow_state_manager: Option<FlowStateManager>,
kv_backend: KvBackendRef,
}
@@ -114,6 +118,7 @@ impl FlowMetadataManager {
flow_name_manager: FlowNameManager::new(kv_backend.clone()),
flownode_flow_manager: FlownodeFlowManager::new(kv_backend.clone()),
table_flow_manager: TableFlowManager::new(kv_backend.clone()),
flow_state_manager: None,
kv_backend,
}
}
@@ -123,6 +128,10 @@ impl FlowMetadataManager {
&self.flow_name_manager
}
pub fn flow_state_manager(&self) -> Option<&FlowStateManager> {
self.flow_state_manager.as_ref()
}
/// Returns the [`FlowInfoManager`].
pub fn flow_info_manager(&self) -> &FlowInfoManager {
&self.flow_info_manager

View File

@@ -12,8 +12,6 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::sync::Arc;
use futures::stream::BoxStream;
use lazy_static::lazy_static;
use regex::Regex;
@@ -201,7 +199,7 @@ impl FlowNameManager {
self.kv_backend.clone(),
req,
DEFAULT_PAGE_SIZE,
Arc::new(flow_name_decoder),
flow_name_decoder,
)
.into_stream();

View File

@@ -12,8 +12,6 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::sync::Arc;
use futures::stream::BoxStream;
use lazy_static::lazy_static;
use regex::Regex;
@@ -179,7 +177,7 @@ impl FlowRouteManager {
self.kv_backend.clone(),
req,
DEFAULT_PAGE_SIZE,
Arc::new(flow_route_decoder),
flow_route_decoder,
)
.into_stream();

View File

@@ -0,0 +1,162 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::BTreeMap;
use std::sync::Arc;
use serde::{Deserialize, Serialize};
use crate::error::{self, Result};
use crate::key::flow::FlowScoped;
use crate::key::{FlowId, MetadataKey, MetadataValue};
use crate::kv_backend::KvBackendRef;
use crate::rpc::store::PutRequest;
/// The entire FlowId to Flow Size's Map is stored directly in the value part of the key.
const FLOW_STATE_KEY: &str = "state";
/// The key of flow state.
#[derive(Debug, Clone, Copy, PartialEq)]
struct FlowStateKeyInner;
impl FlowStateKeyInner {
pub fn new() -> Self {
Self
}
}
impl<'a> MetadataKey<'a, FlowStateKeyInner> for FlowStateKeyInner {
fn to_bytes(&self) -> Vec<u8> {
FLOW_STATE_KEY.as_bytes().to_vec()
}
fn from_bytes(bytes: &'a [u8]) -> Result<FlowStateKeyInner> {
let key = std::str::from_utf8(bytes).map_err(|e| {
error::InvalidMetadataSnafu {
err_msg: format!(
"FlowInfoKeyInner '{}' is not a valid UTF8 string: {e}",
String::from_utf8_lossy(bytes)
),
}
.build()
})?;
if key != FLOW_STATE_KEY {
return Err(error::InvalidMetadataSnafu {
err_msg: format!("Invalid FlowStateKeyInner '{key}'"),
}
.build());
}
Ok(FlowStateKeyInner::new())
}
}
/// The key stores the state size of the flow.
///
/// The layout: `__flow/state`.
pub struct FlowStateKey(FlowScoped<FlowStateKeyInner>);
impl FlowStateKey {
/// Returns the [FlowStateKey].
pub fn new() -> FlowStateKey {
let inner = FlowStateKeyInner::new();
FlowStateKey(FlowScoped::new(inner))
}
}
impl Default for FlowStateKey {
fn default() -> Self {
Self::new()
}
}
impl<'a> MetadataKey<'a, FlowStateKey> for FlowStateKey {
fn to_bytes(&self) -> Vec<u8> {
self.0.to_bytes()
}
fn from_bytes(bytes: &'a [u8]) -> Result<FlowStateKey> {
Ok(FlowStateKey(FlowScoped::<FlowStateKeyInner>::from_bytes(
bytes,
)?))
}
}
/// The value of flow state size
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq)]
pub struct FlowStateValue {
/// For each key, the bytes of the state in memory
pub state_size: BTreeMap<FlowId, usize>,
}
impl FlowStateValue {
pub fn new(state_size: BTreeMap<FlowId, usize>) -> Self {
Self { state_size }
}
}
pub type FlowStateManagerRef = Arc<FlowStateManager>;
/// The manager of [FlowStateKey]. Since state size changes frequently, we store it in memory.
///
/// This is only used in distributed mode. When meta-srv use heartbeat to update the flow stat report
/// and frontned use get to get the latest flow stat report.
pub struct FlowStateManager {
in_memory: KvBackendRef,
}
impl FlowStateManager {
pub fn new(in_memory: KvBackendRef) -> Self {
Self { in_memory }
}
pub async fn get(&self) -> Result<Option<FlowStateValue>> {
let key = FlowStateKey::new().to_bytes();
self.in_memory
.get(&key)
.await?
.map(|x| FlowStateValue::try_from_raw_value(&x.value))
.transpose()
}
pub async fn put(&self, value: FlowStateValue) -> Result<()> {
let key = FlowStateKey::new().to_bytes();
let value = value.try_as_raw_value()?;
let req = PutRequest::new().with_key(key).with_value(value);
self.in_memory.put(req).await?;
Ok(())
}
}
/// Flow's state report, send regularly through heartbeat message
#[derive(Debug, Clone)]
pub struct FlowStat {
/// For each key, the bytes of the state in memory
pub state_size: BTreeMap<u32, usize>,
}
impl From<FlowStateValue> for FlowStat {
fn from(value: FlowStateValue) -> Self {
Self {
state_size: value.state_size,
}
}
}
impl From<FlowStat> for FlowStateValue {
fn from(value: FlowStat) -> Self {
Self {
state_size: value.state_size,
}
}
}

View File

@@ -12,8 +12,6 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::sync::Arc;
use futures::stream::BoxStream;
use futures::TryStreamExt;
use lazy_static::lazy_static;
@@ -179,7 +177,7 @@ impl FlownodeFlowManager {
self.kv_backend.clone(),
req,
DEFAULT_PAGE_SIZE,
Arc::new(flownode_flow_key_decoder),
flownode_flow_key_decoder,
)
.into_stream();

View File

@@ -206,7 +206,7 @@ impl TableFlowManager {
self.kv_backend.clone(),
req,
DEFAULT_PAGE_SIZE,
Arc::new(table_flow_decoder),
table_flow_decoder,
)
.into_stream();

View File

@@ -14,7 +14,6 @@
use std::collections::HashMap;
use std::fmt::Display;
use std::sync::Arc;
use common_catalog::consts::{DEFAULT_CATALOG_NAME, DEFAULT_SCHEMA_NAME};
use common_time::DatabaseTimeToLive;
@@ -283,7 +282,7 @@ impl SchemaManager {
self.kv_backend.clone(),
req,
DEFAULT_PAGE_SIZE,
Arc::new(schema_decoder),
schema_decoder,
)
.into_stream();
@@ -308,6 +307,7 @@ impl<'a> From<&'a SchemaName> for SchemaNameKey<'a> {
#[cfg(test)]
mod tests {
use std::sync::Arc;
use std::time::Duration;
use super::*;

View File

@@ -269,7 +269,7 @@ impl TableNameManager {
self.kv_backend.clone(),
req,
DEFAULT_PAGE_SIZE,
Arc::new(table_decoder),
table_decoder,
)
.into_stream();

View File

@@ -36,7 +36,7 @@ pub mod postgres;
pub mod test;
pub mod txn;
pub type KvBackendRef = Arc<dyn KvBackend<Error = Error> + Send + Sync>;
pub type KvBackendRef<E = Error> = Arc<dyn KvBackend<Error = E> + Send + Sync>;
#[async_trait]
pub trait KvBackend: TxnService
@@ -161,6 +161,9 @@ where
Self::Error: ErrorExt,
{
fn reset(&self);
/// Upcast as `KvBackendRef`. Since https://github.com/rust-lang/rust/issues/65991 is not yet stable.
fn as_kv_backend_ref(self: Arc<Self>) -> KvBackendRef<Self::Error>;
}
pub type ResettableKvBackendRef = Arc<dyn ResettableKvBackend<Error = Error> + Send + Sync>;
pub type ResettableKvBackendRef<E = Error> = Arc<dyn ResettableKvBackend<Error = E> + Send + Sync>;

View File

@@ -16,13 +16,13 @@ use std::any::Any;
use std::collections::BTreeMap;
use std::fmt::{Display, Formatter};
use std::marker::PhantomData;
use std::sync::RwLock;
use std::sync::{Arc, RwLock};
use async_trait::async_trait;
use common_error::ext::ErrorExt;
use serde::Serializer;
use super::ResettableKvBackend;
use super::{KvBackendRef, ResettableKvBackend};
use crate::kv_backend::txn::{Txn, TxnOp, TxnOpResponse, TxnRequest, TxnResponse};
use crate::kv_backend::{KvBackend, TxnService};
use crate::metrics::METRIC_META_TXN_REQUEST;
@@ -311,6 +311,10 @@ impl<T: ErrorExt + Send + Sync + 'static> ResettableKvBackend for MemoryKvBacken
fn reset(&self) {
self.clear();
}
fn as_kv_backend_ref(self: Arc<Self>) -> KvBackendRef<T> {
self
}
}
#[cfg(test)]

View File

@@ -12,8 +12,6 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::sync::Arc;
use async_stream::try_stream;
use common_telemetry::debug;
use futures::Stream;
@@ -148,7 +146,7 @@ impl PaginationStreamFactory {
}
pub struct PaginationStream<T> {
decoder_fn: Arc<KeyValueDecoderFn<T>>,
decoder_fn: fn(KeyValue) -> Result<T>,
factory: PaginationStreamFactory,
}
@@ -158,7 +156,7 @@ impl<T> PaginationStream<T> {
kv: KvBackendRef,
req: RangeRequest,
page_size: usize,
decoder_fn: Arc<KeyValueDecoderFn<T>>,
decoder_fn: fn(KeyValue) -> Result<T>,
) -> Self {
Self {
decoder_fn,
@@ -191,6 +189,7 @@ mod tests {
use std::assert_matches::assert_matches;
use std::collections::BTreeMap;
use std::sync::Arc;
use futures::TryStreamExt;
@@ -250,7 +249,7 @@ mod tests {
..Default::default()
},
DEFAULT_PAGE_SIZE,
Arc::new(decoder),
decoder,
)
.into_stream();
let kv = stream.try_collect::<Vec<_>>().await.unwrap();
@@ -290,7 +289,7 @@ mod tests {
..Default::default()
},
2,
Arc::new(decoder),
decoder,
);
let kv = stream
.into_stream()

View File

@@ -12,8 +12,6 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::sync::Arc;
use async_trait::async_trait;
use common_error::ext::BoxedError;
use common_procedure::error::{DeleteStatesSnafu, ListStateSnafu, PutStateSnafu};
@@ -171,7 +169,7 @@ impl StateStore for KvStateStore {
self.kv_backend.clone(),
req,
self.max_num_per_range_request.unwrap_or_default(),
Arc::new(decode_kv),
decode_kv,
)
.into_stream();

View File

@@ -40,6 +40,8 @@ datatypes.workspace = true
enum-as-inner = "0.6.0"
enum_dispatch = "0.3"
futures = "0.3"
get-size-derive2 = "0.1.2"
get-size2 = "0.1.2"
greptime-proto.workspace = true
# This fork of hydroflow is simply for keeping our dependency in our org, and pin the version
# otherwise it is the same with upstream repo

View File

@@ -60,6 +60,7 @@ use crate::repr::{self, DiffRow, Row, BATCH_SIZE};
mod flownode_impl;
mod parse_expr;
mod stat;
#[cfg(test)]
mod tests;
mod util;
@@ -69,6 +70,7 @@ pub(crate) mod node_context;
mod table_source;
use crate::error::Error;
use crate::utils::StateReportHandler;
use crate::FrontendInvoker;
// `GREPTIME_TIMESTAMP` is not used to distinguish when table is created automatically by flow
@@ -137,6 +139,8 @@ pub struct FlowWorkerManager {
///
/// So that a series of event like `inserts -> flush` can be handled correctly
flush_lock: RwLock<()>,
/// receive a oneshot sender to send state size report
state_report_handler: RwLock<Option<StateReportHandler>>,
}
/// Building FlownodeManager
@@ -170,9 +174,15 @@ impl FlowWorkerManager {
tick_manager,
node_id,
flush_lock: RwLock::new(()),
state_report_handler: RwLock::new(None),
}
}
pub async fn with_state_report_handler(self, handler: StateReportHandler) -> Self {
*self.state_report_handler.write().await = Some(handler);
self
}
/// Create a flownode manager with one worker
pub fn new_with_worker<'s>(
node_id: Option<u32>,
@@ -500,6 +510,27 @@ impl FlowWorkerManager {
/// Flow Runtime related methods
impl FlowWorkerManager {
/// Start state report handler, which will receive a sender from HeartbeatTask to send state size report back
///
/// if heartbeat task is shutdown, this future will exit too
async fn start_state_report_handler(self: Arc<Self>) -> Option<JoinHandle<()>> {
let state_report_handler = self.state_report_handler.write().await.take();
if let Some(mut handler) = state_report_handler {
let zelf = self.clone();
let handler = common_runtime::spawn_global(async move {
while let Some(ret_handler) = handler.recv().await {
let state_report = zelf.gen_state_report().await;
ret_handler.send(state_report).unwrap_or_else(|err| {
common_telemetry::error!(err; "Send state size report error");
});
}
});
Some(handler)
} else {
None
}
}
/// run in common_runtime background runtime
pub fn run_background(
self: Arc<Self>,
@@ -507,6 +538,7 @@ impl FlowWorkerManager {
) -> JoinHandle<()> {
info!("Starting flownode manager's background task");
common_runtime::spawn_global(async move {
let _state_report_handler = self.clone().start_state_report_handler().await;
self.run(shutdown).await;
})
}
@@ -533,6 +565,8 @@ impl FlowWorkerManager {
let default_interval = Duration::from_secs(1);
let mut avg_spd = 0; // rows/sec
let mut since_last_run = tokio::time::Instant::now();
let run_per_trace = 10;
let mut run_cnt = 0;
loop {
// TODO(discord9): only run when new inputs arrive or scheduled to
let row_cnt = self.run_available(true).await.unwrap_or_else(|err| {
@@ -575,10 +609,19 @@ impl FlowWorkerManager {
} else {
(9 * avg_spd + cur_spd) / 10
};
trace!("avg_spd={} r/s, cur_spd={} r/s", avg_spd, cur_spd);
let new_wait = BATCH_SIZE * 1000 / avg_spd.max(1); //in ms
let new_wait = Duration::from_millis(new_wait as u64).min(default_interval);
trace!("Wait for {} ms, row_cnt={}", new_wait.as_millis(), row_cnt);
// print trace every `run_per_trace` times so that we can see if there is something wrong
// but also not get flooded with trace
if run_cnt >= run_per_trace {
trace!("avg_spd={} r/s, cur_spd={} r/s", avg_spd, cur_spd);
trace!("Wait for {} ms, row_cnt={}", new_wait.as_millis(), row_cnt);
run_cnt = 0;
} else {
run_cnt += 1;
}
METRIC_FLOW_RUN_INTERVAL_MS.set(new_wait.as_millis() as i64);
since_last_run = tokio::time::Instant::now();
tokio::time::sleep(new_wait).await;
@@ -638,13 +681,18 @@ impl FlowWorkerManager {
&self,
region_id: RegionId,
rows: Vec<DiffRow>,
batch_datatypes: &[ConcreteDataType],
) -> Result<(), Error> {
let rows_len = rows.len();
let table_id = region_id.table_id();
let _timer = METRIC_FLOW_INSERT_ELAPSED
.with_label_values(&[table_id.to_string().as_str()])
.start_timer();
self.node_context.read().await.send(table_id, rows).await?;
self.node_context
.read()
.await
.send(table_id, rows, batch_datatypes)
.await?;
trace!(
"Handling write request for table_id={} with {} rows",
table_id,

View File

@@ -28,6 +28,7 @@ use itertools::Itertools;
use snafu::{OptionExt, ResultExt};
use store_api::storage::RegionId;
use super::util::from_proto_to_data_type;
use crate::adapter::{CreateFlowArgs, FlowWorkerManager};
use crate::error::InternalSnafu;
use crate::metrics::METRIC_FLOW_TASK_COUNT;
@@ -206,9 +207,17 @@ impl Flownode for FlowWorkerManager {
})
.map(|r| (r, now, 1))
.collect_vec();
self.handle_write_request(region_id.into(), rows)
.await
let batch_datatypes = insert_schema
.iter()
.map(from_proto_to_data_type)
.collect::<std::result::Result<Vec<_>, _>>()
.map_err(to_meta_err)?;
self.handle_write_request(region_id.into(), rows, &batch_datatypes)
.await
.map_err(|err| {
common_telemetry::error!(err;"Failed to handle write request");
to_meta_err(err)
})?;
}
Ok(Default::default())
}

View File

@@ -19,6 +19,7 @@ use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use common_telemetry::trace;
use datatypes::prelude::ConcreteDataType;
use session::context::QueryContext;
use snafu::{OptionExt, ResultExt};
use table::metadata::TableId;
@@ -131,7 +132,11 @@ impl SourceSender {
}
/// return number of rows it actual send(including what's in the buffer)
pub async fn send_rows(&self, rows: Vec<DiffRow>) -> Result<usize, Error> {
pub async fn send_rows(
&self,
rows: Vec<DiffRow>,
batch_datatypes: &[ConcreteDataType],
) -> Result<usize, Error> {
METRIC_FLOW_INPUT_BUF_SIZE.add(rows.len() as _);
while self.send_buf_row_cnt.load(Ordering::SeqCst) >= BATCH_SIZE * 4 {
tokio::task::yield_now().await;
@@ -139,8 +144,11 @@ impl SourceSender {
// row count metrics is approx so relaxed order is ok
self.send_buf_row_cnt
.fetch_add(rows.len(), Ordering::SeqCst);
let batch = Batch::try_from_rows(rows.into_iter().map(|(row, _, _)| row).collect())
.context(EvalSnafu)?;
let batch = Batch::try_from_rows_with_types(
rows.into_iter().map(|(row, _, _)| row).collect(),
batch_datatypes,
)
.context(EvalSnafu)?;
common_telemetry::trace!("Send one batch to worker with {} rows", batch.row_count());
self.send_buf_tx.send(batch).await.map_err(|e| {
crate::error::InternalSnafu {
@@ -157,14 +165,19 @@ impl FlownodeContext {
/// return number of rows it actual send(including what's in the buffer)
///
/// TODO(discord9): make this concurrent
pub async fn send(&self, table_id: TableId, rows: Vec<DiffRow>) -> Result<usize, Error> {
pub async fn send(
&self,
table_id: TableId,
rows: Vec<DiffRow>,
batch_datatypes: &[ConcreteDataType],
) -> Result<usize, Error> {
let sender = self
.source_sender
.get(&table_id)
.with_context(|| TableNotFoundSnafu {
name: table_id.to_string(),
})?;
sender.send_rows(rows).await
sender.send_rows(rows, batch_datatypes).await
}
/// flush all sender's buf

View File

@@ -0,0 +1,40 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::BTreeMap;
use common_meta::key::flow::flow_state::FlowStat;
use crate::FlowWorkerManager;
impl FlowWorkerManager {
pub async fn gen_state_report(&self) -> FlowStat {
let mut full_report = BTreeMap::new();
for worker in self.worker_handles.iter() {
let worker = worker.lock().await;
match worker.get_state_size().await {
Ok(state_size) => {
full_report.extend(state_size.into_iter().map(|(k, v)| (k as u32, v)))
}
Err(err) => {
common_telemetry::error!(err; "Get flow stat size error");
}
}
}
FlowStat {
state_size: full_report,
}
}
}

View File

@@ -16,12 +16,27 @@ use api::helper::ColumnDataTypeWrapper;
use api::v1::column_def::options_from_column_schema;
use api::v1::{ColumnDataType, ColumnDataTypeExtension, SemanticType};
use common_error::ext::BoxedError;
use datatypes::prelude::ConcreteDataType;
use datatypes::schema::ColumnSchema;
use itertools::Itertools;
use snafu::ResultExt;
use crate::error::{Error, ExternalSnafu};
pub fn from_proto_to_data_type(
column_schema: &api::v1::ColumnSchema,
) -> Result<ConcreteDataType, Error> {
let wrapper = ColumnDataTypeWrapper::try_new(
column_schema.datatype,
column_schema.datatype_extension.clone(),
)
.map_err(BoxedError::new)
.context(ExternalSnafu)?;
let cdt = ConcreteDataType::from(wrapper);
Ok(cdt)
}
/// convert `ColumnSchema` lists to it's corresponding proto type
pub fn column_schemas_to_proto(
column_schemas: Vec<ColumnSchema>,

View File

@@ -197,6 +197,21 @@ impl WorkerHandle {
.fail()
}
}
pub async fn get_state_size(&self) -> Result<BTreeMap<FlowId, usize>, Error> {
let ret = self
.itc_client
.call_with_resp(Request::QueryStateSize)
.await?;
ret.into_query_state_size().map_err(|ret| {
InternalSnafu {
reason: format!(
"Flow Node/Worker itc failed, expect Response::QueryStateSize, found {ret:?}"
),
}
.build()
})
}
}
impl Drop for WorkerHandle {
@@ -361,6 +376,13 @@ impl<'s> Worker<'s> {
Some(Response::ContainTask { result: ret })
}
Request::Shutdown => return Err(()),
Request::QueryStateSize => {
let mut ret = BTreeMap::new();
for (flow_id, task_state) in self.task_states.iter() {
ret.insert(*flow_id, task_state.state.get_state_size());
}
Some(Response::QueryStateSize { result: ret })
}
};
Ok(ret)
}
@@ -391,6 +413,7 @@ pub enum Request {
flow_id: FlowId,
},
Shutdown,
QueryStateSize,
}
#[derive(Debug, EnumAsInner)]
@@ -406,6 +429,10 @@ enum Response {
result: bool,
},
RunAvail,
QueryStateSize {
/// each flow tasks' state size
result: BTreeMap<FlowId, usize>,
},
}
fn create_inter_thread_call() -> (InterThreadCallClient, InterThreadCallServer) {
@@ -423,10 +450,12 @@ struct InterThreadCallClient {
}
impl InterThreadCallClient {
/// call without response
fn call_no_resp(&self, req: Request) -> Result<(), Error> {
self.arg_sender.send((req, None)).map_err(from_send_error)
}
/// call with response
async fn call_with_resp(&self, req: Request) -> Result<Response, Error> {
let (tx, rx) = oneshot::channel();
self.arg_sender
@@ -527,6 +556,7 @@ mod test {
);
tx.send(Batch::empty()).unwrap();
handle.run_available(0, true).await.unwrap();
assert_eq!(handle.get_state_size().await.unwrap().len(), 1);
assert_eq!(sink_rx.recv().await.unwrap(), Batch::empty());
drop(handle);
worker_thread_handle.join().unwrap();

View File

@@ -30,7 +30,7 @@ use crate::compute::types::{Collection, CollectionBundle, ErrCollector, Toff};
use crate::error::{Error, InvalidQuerySnafu, NotImplementedSnafu};
use crate::expr::{self, Batch, GlobalId, LocalId};
use crate::plan::{Plan, TypedPlan};
use crate::repr::{self, DiffRow};
use crate::repr::{self, DiffRow, RelationType};
mod map;
mod reduce;
@@ -124,10 +124,10 @@ impl Context<'_, '_> {
/// Like `render_plan` but in Batch Mode
pub fn render_plan_batch(&mut self, plan: TypedPlan) -> Result<CollectionBundle<Batch>, Error> {
match plan.plan {
Plan::Constant { rows } => Ok(self.render_constant_batch(rows)),
Plan::Constant { rows } => Ok(self.render_constant_batch(rows, &plan.schema.typ)),
Plan::Get { id } => self.get_batch_by_id(id),
Plan::Let { id, value, body } => self.eval_batch_let(id, value, body),
Plan::Mfp { input, mfp } => self.render_mfp_batch(input, mfp),
Plan::Mfp { input, mfp } => self.render_mfp_batch(input, mfp, &plan.schema.typ),
Plan::Reduce {
input,
key_val_plan,
@@ -172,7 +172,11 @@ impl Context<'_, '_> {
/// render Constant, take all rows that have a timestamp not greater than the current time
/// This function is primarily used for testing
/// Always assume input is sorted by timestamp
pub fn render_constant_batch(&mut self, rows: Vec<DiffRow>) -> CollectionBundle<Batch> {
pub fn render_constant_batch(
&mut self,
rows: Vec<DiffRow>,
output_type: &RelationType,
) -> CollectionBundle<Batch> {
let (send_port, recv_port) = self.df.make_edge::<_, Toff<Batch>>("constant_batch");
let mut per_time: BTreeMap<repr::Timestamp, Vec<DiffRow>> = Default::default();
for (key, group) in &rows.into_iter().group_by(|(_row, ts, _diff)| *ts) {
@@ -185,6 +189,8 @@ impl Context<'_, '_> {
let scheduler_inner = scheduler.clone();
let err_collector = self.err_collector.clone();
let output_type = output_type.clone();
let subgraph_id =
self.df
.add_subgraph_source("ConstantBatch", send_port, move |_ctx, send_port| {
@@ -199,7 +205,14 @@ impl Context<'_, '_> {
not_great_than_now.into_iter().for_each(|(_ts, rows)| {
err_collector.run(|| {
let rows = rows.into_iter().map(|(row, _ts, _diff)| row).collect();
let batch = Batch::try_from_rows(rows)?;
let batch = Batch::try_from_rows_with_types(
rows,
&output_type
.column_types
.iter()
.map(|ty| ty.scalar_type().clone())
.collect_vec(),
)?;
send_port.give(vec![batch]);
Ok(())
});

View File

@@ -25,7 +25,7 @@ use crate::compute::types::{Arranged, Collection, CollectionBundle, ErrCollector
use crate::error::{Error, PlanSnafu};
use crate::expr::{Batch, EvalError, MapFilterProject, MfpPlan, ScalarExpr};
use crate::plan::TypedPlan;
use crate::repr::{self, DiffRow, KeyValDiffRow, Row};
use crate::repr::{self, DiffRow, KeyValDiffRow, RelationType, Row};
use crate::utils::ArrangeHandler;
impl Context<'_, '_> {
@@ -34,6 +34,7 @@ impl Context<'_, '_> {
&mut self,
input: Box<TypedPlan>,
mfp: MapFilterProject,
_output_type: &RelationType,
) -> Result<CollectionBundle<Batch>, Error> {
let input = self.render_plan_batch(*input)?;

View File

@@ -87,6 +87,8 @@ impl Context<'_, '_> {
})?;
let key_val_plan = key_val_plan.clone();
let output_type = output_type.clone();
let now = self.compute_state.current_time_ref();
let err_collector = self.err_collector.clone();
@@ -118,6 +120,7 @@ impl Context<'_, '_> {
src_data,
&key_val_plan,
&accum_plan,
&output_type,
SubgraphArg {
now,
err_collector: &err_collector,
@@ -354,6 +357,7 @@ fn reduce_batch_subgraph(
src_data: impl IntoIterator<Item = Batch>,
key_val_plan: &KeyValPlan,
accum_plan: &AccumulablePlan,
output_type: &RelationType,
SubgraphArg {
now,
err_collector,
@@ -535,17 +539,13 @@ fn reduce_batch_subgraph(
// this output part is not supposed to be resource intensive
// (because for every batch there wouldn't usually be as many output row?),
// so we can do some costly operation here
let output_types = all_output_dict.first_entry().map(|entry| {
entry
.key()
.iter()
.chain(entry.get().iter())
.map(|v| v.data_type())
.collect::<Vec<ConcreteDataType>>()
});
let output_types = output_type
.column_types
.iter()
.map(|t| t.scalar_type.clone())
.collect_vec();
if let Some(output_types) = output_types {
err_collector.run(|| {
err_collector.run(|| {
let column_cnt = output_types.len();
let row_cnt = all_output_dict.len();
@@ -585,7 +585,6 @@ fn reduce_batch_subgraph(
Ok(())
});
}
}
/// reduce subgraph, reduce the input data into a single row
@@ -1516,7 +1515,9 @@ mod test {
let mut ctx = harness_test_ctx(&mut df, &mut state);
let rows = vec![
(Row::new(vec![1i64.into()]), 1, 1),
(Row::new(vec![Value::Null]), -1, 1),
(Row::new(vec![1i64.into()]), 0, 1),
(Row::new(vec![Value::Null]), 1, 1),
(Row::new(vec![2i64.into()]), 2, 1),
(Row::new(vec![3i64.into()]), 3, 1),
(Row::new(vec![1i64.into()]), 4, 1),
@@ -1558,13 +1559,15 @@ mod test {
Box::new(input_plan.with_types(typ.into_unnamed())),
&key_val_plan,
&reduce_plan,
&RelationType::empty(),
&RelationType::new(vec![ColumnType::new(CDT::int64_datatype(), true)]),
)
.unwrap();
{
let now_inner = now.clone();
let expected = BTreeMap::<i64, Vec<i64>>::from([
(-1, vec![]),
(0, vec![1i64]),
(1, vec![1i64]),
(2, vec![3i64]),
(3, vec![6i64]),
@@ -1581,7 +1584,11 @@ mod test {
if let Some(expected) = expected.get(&now) {
let batch = expected.iter().map(|v| Value::from(*v)).collect_vec();
let batch = Batch::try_from_rows(vec![batch.into()]).unwrap();
let batch = Batch::try_from_rows_with_types(
vec![batch.into()],
&[CDT::int64_datatype()],
)
.unwrap();
assert_eq!(res.first(), Some(&batch));
}
});

View File

@@ -16,6 +16,7 @@ use std::cell::RefCell;
use std::collections::{BTreeMap, VecDeque};
use std::rc::Rc;
use get_size2::GetSize;
use hydroflow::scheduled::graph::Hydroflow;
use hydroflow::scheduled::SubgraphId;
@@ -109,6 +110,10 @@ impl DataflowState {
pub fn expire_after(&self) -> Option<Timestamp> {
self.expire_after
}
pub fn get_state_size(&self) -> usize {
self.arrange_used.iter().map(|x| x.read().get_size()).sum()
}
}
#[derive(Debug, Clone)]

View File

@@ -24,7 +24,7 @@ mod scalar;
mod signature;
use arrow::compute::FilterBuilder;
use datatypes::prelude::DataType;
use datatypes::prelude::{ConcreteDataType, DataType};
use datatypes::value::Value;
use datatypes::vectors::{BooleanVector, Helper, VectorRef};
pub(crate) use df_func::{DfScalarFunction, RawDfScalarFn};
@@ -85,16 +85,18 @@ impl Default for Batch {
}
impl Batch {
pub fn try_from_rows(rows: Vec<crate::repr::Row>) -> Result<Self, EvalError> {
/// Get batch from rows, will try best to determine data type
pub fn try_from_rows_with_types(
rows: Vec<crate::repr::Row>,
batch_datatypes: &[ConcreteDataType],
) -> Result<Self, EvalError> {
if rows.is_empty() {
return Ok(Self::empty());
}
let len = rows.len();
let mut builder = rows
.first()
.unwrap()
let mut builder = batch_datatypes
.iter()
.map(|v| v.data_type().create_mutable_vector(len))
.map(|ty| ty.create_mutable_vector(len))
.collect_vec();
for row in rows {
ensure!(
@@ -221,10 +223,25 @@ impl Batch {
return Ok(());
}
let dts = if self.batch.is_empty() {
other.batch.iter().map(|v| v.data_type()).collect_vec()
} else {
self.batch.iter().map(|v| v.data_type()).collect_vec()
let dts = {
let max_len = self.batch.len().max(other.batch.len());
let mut dts = Vec::with_capacity(max_len);
for i in 0..max_len {
if let Some(v) = self.batch().get(i)
&& !v.data_type().is_null()
{
dts.push(v.data_type())
} else if let Some(v) = other.batch().get(i)
&& !v.data_type().is_null()
{
dts.push(v.data_type())
} else {
// both are null, so we will push null type
dts.push(datatypes::prelude::ConcreteDataType::null_datatype())
}
}
dts
};
let batch_builders = dts

View File

@@ -908,20 +908,33 @@ mod test {
.unwrap()
.unwrap();
assert_eq!(ret, Row::pack(vec![Value::from(false), Value::from(true)]));
let ty = [
ConcreteDataType::int32_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int32_datatype(),
];
// batch mode
let mut batch = Batch::try_from_rows(vec![Row::from(vec![
Value::from(4),
Value::from(2),
Value::from(3),
])])
let mut batch = Batch::try_from_rows_with_types(
vec![Row::from(vec![
Value::from(4),
Value::from(2),
Value::from(3),
])],
&ty,
)
.unwrap();
let ret = safe_mfp.eval_batch_into(&mut batch).unwrap();
assert_eq!(
ret,
Batch::try_from_rows(vec![Row::from(vec![Value::from(false), Value::from(true)])])
.unwrap()
Batch::try_from_rows_with_types(
vec![Row::from(vec![Value::from(false), Value::from(true)])],
&[
ConcreteDataType::boolean_datatype(),
ConcreteDataType::boolean_datatype(),
],
)
.unwrap()
);
}
@@ -956,7 +969,15 @@ mod test {
.unwrap();
assert_eq!(ret, None);
let mut input1_batch = Batch::try_from_rows(vec![Row::new(input1)]).unwrap();
let input_type = [
ConcreteDataType::int32_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::string_datatype(),
];
let mut input1_batch =
Batch::try_from_rows_with_types(vec![Row::new(input1)], &input_type).unwrap();
let ret_batch = safe_mfp.eval_batch_into(&mut input1_batch).unwrap();
assert_eq!(
ret_batch,
@@ -974,7 +995,8 @@ mod test {
.unwrap();
assert_eq!(ret, Some(Row::pack(vec![Value::from(11)])));
let mut input2_batch = Batch::try_from_rows(vec![Row::new(input2)]).unwrap();
let mut input2_batch =
Batch::try_from_rows_with_types(vec![Row::new(input2)], &input_type).unwrap();
let ret_batch = safe_mfp.eval_batch_into(&mut input2_batch).unwrap();
assert_eq!(
ret_batch,
@@ -1027,7 +1049,14 @@ mod test {
let ret = safe_mfp.evaluate_into(&mut input1.clone(), &mut Row::empty());
assert!(matches!(ret, Err(EvalError::InvalidArgument { .. })));
let mut input1_batch = Batch::try_from_rows(vec![Row::new(input1)]).unwrap();
let input_type = [
ConcreteDataType::int64_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int32_datatype(),
];
let mut input1_batch =
Batch::try_from_rows_with_types(vec![Row::new(input1)], &input_type).unwrap();
let ret_batch = safe_mfp.eval_batch_into(&mut input1_batch);
assert!(matches!(ret_batch, Err(EvalError::InvalidArgument { .. })));
@@ -1037,7 +1066,13 @@ mod test {
.unwrap();
assert_eq!(ret, Some(Row::new(input2.clone())));
let input2_batch = Batch::try_from_rows(vec![Row::new(input2)]).unwrap();
let input_type = [
ConcreteDataType::int64_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int32_datatype(),
];
let input2_batch =
Batch::try_from_rows_with_types(vec![Row::new(input2)], &input_type).unwrap();
let ret_batch = safe_mfp.eval_batch_into(&mut input2_batch.clone()).unwrap();
assert_eq!(ret_batch, input2_batch);
@@ -1047,7 +1082,8 @@ mod test {
.unwrap();
assert_eq!(ret, None);
let input3_batch = Batch::try_from_rows(vec![Row::new(input3)]).unwrap();
let input3_batch =
Batch::try_from_rows_with_types(vec![Row::new(input3)], &input_type).unwrap();
let ret_batch = safe_mfp.eval_batch_into(&mut input3_batch.clone()).unwrap();
assert_eq!(
ret_batch,
@@ -1083,7 +1119,13 @@ mod test {
let ret = safe_mfp.evaluate_into(&mut input1.clone(), &mut Row::empty());
assert_eq!(ret.unwrap(), Some(Row::new(vec![Value::from(false)])));
let mut input1_batch = Batch::try_from_rows(vec![Row::new(input1)]).unwrap();
let input_type = [
ConcreteDataType::int32_datatype(),
ConcreteDataType::int32_datatype(),
ConcreteDataType::int32_datatype(),
];
let mut input1_batch =
Batch::try_from_rows_with_types(vec![Row::new(input1)], &input_type).unwrap();
let ret_batch = safe_mfp.eval_batch_into(&mut input1_batch).unwrap();
assert_eq!(

View File

@@ -24,6 +24,7 @@ use common_meta::heartbeat::handler::{
};
use common_meta::heartbeat::mailbox::{HeartbeatMailbox, MailboxRef, OutgoingMessage};
use common_meta::heartbeat::utils::outgoing_message_to_mailbox_message;
use common_meta::key::flow::flow_state::FlowStat;
use common_telemetry::{debug, error, info, warn};
use greptime_proto::v1::meta::NodeInfo;
use meta_client::client::{HeartbeatSender, HeartbeatStream, MetaClient};
@@ -34,8 +35,27 @@ use tokio::sync::mpsc;
use tokio::time::Duration;
use crate::error::ExternalSnafu;
use crate::utils::SizeReportSender;
use crate::{Error, FlownodeOptions};
async fn query_flow_state(
query_stat_size: &Option<SizeReportSender>,
timeout: Duration,
) -> Option<FlowStat> {
if let Some(report_requester) = query_stat_size.as_ref() {
let ret = report_requester.query(timeout).await;
match ret {
Ok(latest) => Some(latest),
Err(err) => {
error!(err; "Failed to get query stat size");
None
}
}
} else {
None
}
}
/// The flownode heartbeat task which sending `[HeartbeatRequest]` to Metasrv periodically in background.
#[derive(Clone)]
pub struct HeartbeatTask {
@@ -47,9 +67,14 @@ pub struct HeartbeatTask {
resp_handler_executor: HeartbeatResponseHandlerExecutorRef,
start_time_ms: u64,
running: Arc<AtomicBool>,
query_stat_size: Option<SizeReportSender>,
}
impl HeartbeatTask {
pub fn with_query_stat_size(mut self, query_stat_size: SizeReportSender) -> Self {
self.query_stat_size = Some(query_stat_size);
self
}
pub fn new(
opts: &FlownodeOptions,
meta_client: Arc<MetaClient>,
@@ -65,6 +90,7 @@ impl HeartbeatTask {
resp_handler_executor,
start_time_ms: common_time::util::current_time_millis() as u64,
running: Arc::new(AtomicBool::new(false)),
query_stat_size: None,
}
}
@@ -112,6 +138,7 @@ impl HeartbeatTask {
message: Option<OutgoingMessage>,
peer: Option<Peer>,
start_time_ms: u64,
latest_report: &Option<FlowStat>,
) -> Option<HeartbeatRequest> {
let mailbox_message = match message.map(outgoing_message_to_mailbox_message) {
Some(Ok(message)) => Some(message),
@@ -121,11 +148,22 @@ impl HeartbeatTask {
}
None => None,
};
let flow_stat = latest_report
.as_ref()
.map(|report| {
report
.state_size
.iter()
.map(|(k, v)| (*k, *v as u64))
.collect()
})
.map(|f| api::v1::meta::FlowStat { flow_stat_size: f });
Some(HeartbeatRequest {
mailbox_message,
peer,
info: Self::build_node_info(start_time_ms),
flow_stat,
..Default::default()
})
}
@@ -151,24 +189,27 @@ impl HeartbeatTask {
addr: self.peer_addr.clone(),
});
let query_stat_size = self.query_stat_size.clone();
common_runtime::spawn_hb(async move {
// note that using interval will cause it to first immediately send
// a heartbeat
let mut interval = tokio::time::interval(report_interval);
interval.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay);
let mut latest_report = None;
loop {
let req = tokio::select! {
message = outgoing_rx.recv() => {
if let Some(message) = message {
Self::create_heartbeat_request(Some(message), self_peer.clone(), start_time_ms)
Self::create_heartbeat_request(Some(message), self_peer.clone(), start_time_ms, &latest_report)
} else {
// Receives None that means Sender was dropped, we need to break the current loop
break
}
}
_ = interval.tick() => {
Self::create_heartbeat_request(None, self_peer.clone(), start_time_ms)
Self::create_heartbeat_request(None, self_peer.clone(), start_time_ms, &latest_report)
}
};
@@ -180,6 +221,10 @@ impl HeartbeatTask {
debug!("Send a heartbeat request to metasrv, content: {:?}", req);
}
}
// after sending heartbeat, try to get the latest report
// TODO(discord9): consider a better place to update the size report
// set the timeout to half of the report interval so that it wouldn't delay heartbeat if something went horribly wrong
latest_report = query_flow_state(&query_stat_size, report_interval / 2).await;
}
});
}

View File

@@ -22,12 +22,14 @@ use api::v1::Row as ProtoRow;
use datatypes::data_type::ConcreteDataType;
use datatypes::types::cast;
use datatypes::value::Value;
use get_size2::GetSize;
use itertools::Itertools;
pub(crate) use relation::{ColumnType, Key, RelationDesc, RelationType};
use serde::{Deserialize, Serialize};
use snafu::ResultExt;
use crate::expr::error::{CastValueSnafu, EvalError, InvalidArgumentSnafu};
use crate::utils::get_value_heap_size;
/// System-wide Record count difference type. Useful for capture data change
///
@@ -105,6 +107,12 @@ pub struct Row {
pub inner: Vec<Value>,
}
impl GetSize for Row {
fn get_heap_size(&self) -> usize {
self.inner.iter().map(get_value_heap_size).sum()
}
}
impl Row {
/// Create an empty row
pub fn empty() -> Self {

View File

@@ -55,6 +55,7 @@ use crate::error::{
};
use crate::heartbeat::HeartbeatTask;
use crate::transform::register_function_to_query_engine;
use crate::utils::{SizeReportSender, StateReportHandler};
use crate::{Error, FlowWorkerManager, FlownodeOptions};
pub const FLOW_NODE_SERVER_NAME: &str = "FLOW_NODE_SERVER";
@@ -236,6 +237,8 @@ pub struct FlownodeBuilder {
catalog_manager: CatalogManagerRef,
flow_metadata_manager: FlowMetadataManagerRef,
heartbeat_task: Option<HeartbeatTask>,
/// receive a oneshot sender to send state size report
state_report_handler: Option<StateReportHandler>,
}
impl FlownodeBuilder {
@@ -254,17 +257,20 @@ impl FlownodeBuilder {
catalog_manager,
flow_metadata_manager,
heartbeat_task: None,
state_report_handler: None,
}
}
pub fn with_heartbeat_task(self, heartbeat_task: HeartbeatTask) -> Self {
let (sender, receiver) = SizeReportSender::new();
Self {
heartbeat_task: Some(heartbeat_task),
heartbeat_task: Some(heartbeat_task.with_query_stat_size(sender)),
state_report_handler: Some(receiver),
..self
}
}
pub async fn build(self) -> Result<FlownodeInstance, Error> {
pub async fn build(mut self) -> Result<FlownodeInstance, Error> {
// TODO(discord9): does this query engine need those?
let query_engine_factory = QueryEngineFactory::new_with_plugins(
// query engine in flownode is only used for translate plan with resolved table source.
@@ -383,7 +389,7 @@ impl FlownodeBuilder {
/// build [`FlowWorkerManager`], note this doesn't take ownership of `self`,
/// nor does it actually start running the worker.
async fn build_manager(
&self,
&mut self,
query_engine: Arc<dyn QueryEngine>,
) -> Result<FlowWorkerManager, Error> {
let table_meta = self.table_meta.clone();
@@ -402,12 +408,15 @@ impl FlownodeBuilder {
info!("Flow Worker started in new thread");
worker.run();
});
let man = rx.await.map_err(|_e| {
let mut man = rx.await.map_err(|_e| {
UnexpectedSnafu {
reason: "sender is dropped, failed to create flow node manager",
}
.build()
})?;
if let Some(handler) = self.state_report_handler.take() {
man = man.with_state_report_handler(handler).await;
}
info!("Flow Node Manager started");
Ok(man)
}

View File

@@ -18,16 +18,73 @@ use std::collections::{BTreeMap, BTreeSet};
use std::ops::Bound;
use std::sync::Arc;
use common_meta::key::flow::flow_state::FlowStat;
use common_telemetry::trace;
use datatypes::value::Value;
use get_size2::GetSize;
use smallvec::{smallvec, SmallVec};
use tokio::sync::RwLock;
use tokio::sync::{mpsc, oneshot, RwLock};
use tokio::time::Instant;
use crate::error::InternalSnafu;
use crate::expr::{EvalError, ScalarExpr};
use crate::repr::{value_to_internal_ts, DiffRow, Duration, KeyValDiffRow, Row, Timestamp};
/// A batch of updates, arranged by key
pub type Batch = BTreeMap<Row, SmallVec<[DiffRow; 2]>>;
/// Get a estimate of heap size of a value
pub fn get_value_heap_size(v: &Value) -> usize {
match v {
Value::Binary(bin) => bin.len(),
Value::String(s) => s.len(),
Value::List(list) => list.items().iter().map(get_value_heap_size).sum(),
_ => 0,
}
}
#[derive(Clone)]
pub struct SizeReportSender {
inner: mpsc::Sender<oneshot::Sender<FlowStat>>,
}
impl SizeReportSender {
pub fn new() -> (Self, StateReportHandler) {
let (tx, rx) = mpsc::channel(1);
let zelf = Self { inner: tx };
(zelf, rx)
}
/// Query the size report, will timeout after one second if no response
pub async fn query(&self, timeout: std::time::Duration) -> crate::Result<FlowStat> {
let (tx, rx) = oneshot::channel();
self.inner.send(tx).await.map_err(|_| {
InternalSnafu {
reason: "failed to send size report request due to receiver dropped",
}
.build()
})?;
let timeout = tokio::time::timeout(timeout, rx);
timeout
.await
.map_err(|_elapsed| {
InternalSnafu {
reason: "failed to receive size report after one second timeout",
}
.build()
})?
.map_err(|_| {
InternalSnafu {
reason: "failed to receive size report due to sender dropped",
}
.build()
})
}
}
/// Handle the size report request, and send the report back
pub type StateReportHandler = mpsc::Receiver<oneshot::Sender<FlowStat>>;
/// A spine of batches, arranged by timestamp
/// TODO(discord9): consider internally index by key, value, and timestamp for faster lookup
pub type Spine = BTreeMap<Timestamp, Batch>;
@@ -49,6 +106,24 @@ pub struct KeyExpiryManager {
event_timestamp_from_row: Option<ScalarExpr>,
}
impl GetSize for KeyExpiryManager {
fn get_heap_size(&self) -> usize {
let row_size = if let Some(row_size) = &self
.event_ts_to_key
.first_key_value()
.map(|(_, v)| v.first().get_heap_size())
{
*row_size
} else {
0
};
self.event_ts_to_key
.values()
.map(|v| v.len() * row_size + std::mem::size_of::<i64>())
.sum::<usize>()
}
}
impl KeyExpiryManager {
pub fn new(
key_expiration_duration: Option<Duration>,
@@ -154,7 +229,7 @@ impl KeyExpiryManager {
///
/// Note the two way arrow between reduce operator and arrange, it's because reduce operator need to query existing state
/// and also need to update existing state.
#[derive(Debug, Clone, Default, Eq, PartialEq, Ord, PartialOrd)]
#[derive(Debug, Clone, Eq, PartialEq, Ord, PartialOrd)]
pub struct Arrangement {
/// A name or identifier for the arrangement which can be used for debugging or logging purposes.
/// This field is not critical to the functionality but aids in monitoring and management of arrangements.
@@ -196,6 +271,61 @@ pub struct Arrangement {
/// The time that the last compaction happened, also known as the current time.
last_compaction_time: Option<Timestamp>,
/// Estimated size of the arrangement in heap size.
estimated_size: usize,
last_size_update: Instant,
size_update_interval: tokio::time::Duration,
}
impl Arrangement {
fn compute_size(&self) -> usize {
self.spine
.values()
.map(|v| {
let per_entry_size = v
.first_key_value()
.map(|(k, v)| {
k.get_heap_size()
+ v.len() * v.first().map(|r| r.get_heap_size()).unwrap_or(0)
})
.unwrap_or(0);
std::mem::size_of::<i64>() + v.len() * per_entry_size
})
.sum::<usize>()
+ self.expire_state.get_heap_size()
+ self.name.get_heap_size()
}
fn update_and_fetch_size(&mut self) -> usize {
if self.last_size_update.elapsed() > self.size_update_interval {
self.estimated_size = self.compute_size();
self.last_size_update = Instant::now();
}
self.estimated_size
}
}
impl GetSize for Arrangement {
fn get_heap_size(&self) -> usize {
self.estimated_size
}
}
impl Default for Arrangement {
fn default() -> Self {
Self {
spine: Default::default(),
full_arrangement: false,
is_written: false,
expire_state: None,
last_compaction_time: None,
name: Vec::new(),
estimated_size: 0,
last_size_update: Instant::now(),
size_update_interval: tokio::time::Duration::from_secs(3),
}
}
}
impl Arrangement {
@@ -207,6 +337,9 @@ impl Arrangement {
expire_state: None,
last_compaction_time: None,
name,
estimated_size: 0,
last_size_update: Instant::now(),
size_update_interval: tokio::time::Duration::from_secs(3),
}
}
@@ -269,6 +402,7 @@ impl Arrangement {
// without changing the order of updates within same tick
key_updates.sort_by_key(|(_val, ts, _diff)| *ts);
}
self.update_and_fetch_size();
Ok(max_expired_by)
}
@@ -390,6 +524,7 @@ impl Arrangement {
// insert the compacted batch into spine with key being `now`
self.spine.insert(now, compacting_batch);
self.update_and_fetch_size();
Ok(max_expired_by)
}

View File

@@ -41,6 +41,7 @@ datafusion-expr.workspace = true
datanode.workspace = true
humantime-serde.workspace = true
lazy_static.workspace = true
log-query.workspace = true
log-store.workspace = true
meta-client.workspace = true
opentelemetry-proto.workspace = true

View File

@@ -16,6 +16,7 @@ pub mod builder;
mod grpc;
mod influxdb;
mod log_handler;
mod logs;
mod opentsdb;
mod otlp;
mod prom_store;
@@ -64,8 +65,8 @@ use servers::prometheus_handler::PrometheusHandler;
use servers::query_handler::grpc::GrpcQueryHandler;
use servers::query_handler::sql::SqlQueryHandler;
use servers::query_handler::{
InfluxdbLineProtocolHandler, OpenTelemetryProtocolHandler, OpentsdbProtocolHandler,
PipelineHandler, PromStoreProtocolHandler, ScriptHandler,
InfluxdbLineProtocolHandler, LogQueryHandler, OpenTelemetryProtocolHandler,
OpentsdbProtocolHandler, PipelineHandler, PromStoreProtocolHandler, ScriptHandler,
};
use servers::server::ServerHandlers;
use session::context::QueryContextRef;
@@ -99,6 +100,7 @@ pub trait FrontendInstance:
+ ScriptHandler
+ PrometheusHandler
+ PipelineHandler
+ LogQueryHandler
+ Send
+ Sync
+ 'static

View File

@@ -0,0 +1,67 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use auth::{PermissionChecker, PermissionCheckerRef, PermissionReq};
use client::Output;
use common_error::ext::BoxedError;
use log_query::LogQuery;
use server_error::Result as ServerResult;
use servers::error::{self as server_error, AuthSnafu, ExecuteQuerySnafu};
use servers::interceptor::{LogQueryInterceptor, LogQueryInterceptorRef};
use servers::query_handler::LogQueryHandler;
use session::context::QueryContextRef;
use snafu::ResultExt;
use tonic::async_trait;
use super::Instance;
#[async_trait]
impl LogQueryHandler for Instance {
async fn query(&self, mut request: LogQuery, ctx: QueryContextRef) -> ServerResult<Output> {
let interceptor = self
.plugins
.get::<LogQueryInterceptorRef<server_error::Error>>();
self.plugins
.get::<PermissionCheckerRef>()
.as_ref()
.check_permission(ctx.current_user(), PermissionReq::LogQuery)
.context(AuthSnafu)?;
interceptor.as_ref().pre_query(&request, ctx.clone())?;
request
.time_filter
.canonicalize()
.map_err(BoxedError::new)
.context(ExecuteQuerySnafu)?;
let plan = self
.query_engine
.planner()
.plan_logs_query(request, ctx.clone())
.await
.map_err(BoxedError::new)
.context(ExecuteQuerySnafu)?;
let output = self
.statement_executor
.exec_plan(plan, ctx.clone())
.await
.map_err(BoxedError::new)
.context(ExecuteQuerySnafu)?;
Ok(interceptor.as_ref().post_query(output, ctx.clone())?)
}
}

View File

@@ -87,6 +87,7 @@ where
let ingest_interceptor = self.plugins.get::<LogIngestInterceptorRef<ServerError>>();
builder =
builder.with_log_ingest_handler(self.instance.clone(), validator, ingest_interceptor);
builder = builder.with_logs_handler(self.instance.clone());
if let Some(user_provider) = self.plugins.get::<UserProviderRef>() {
builder = builder.with_user_provider(user_provider);

View File

@@ -15,11 +15,15 @@
use serde::{Deserialize, Serialize};
pub mod creator;
mod error;
pub mod error;
pub mod reader;
pub type Bytes = Vec<u8>;
pub type BytesRef<'a> = &'a [u8];
/// The seed used for the Bloom filter.
pub const SEED: u128 = 42;
/// The Meta information of the bloom filter stored in the file.
#[derive(Debug, Default, Serialize, Deserialize)]
pub struct BloomFilterMeta {

View File

@@ -12,21 +12,23 @@
// See the License for the specific language governing permissions and
// limitations under the License.
use std::collections::HashSet;
mod finalize_segment;
mod intermediate_codec;
use fastbloom::BloomFilter;
use futures::{AsyncWrite, AsyncWriteExt};
use std::collections::HashSet;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use finalize_segment::FinalizedBloomFilterStorage;
use futures::{AsyncWrite, AsyncWriteExt, StreamExt};
use snafu::ResultExt;
use super::error::{IoSnafu, SerdeJsonSnafu};
use crate::bloom_filter::error::Result;
use crate::bloom_filter::{BloomFilterMeta, BloomFilterSegmentLocation, Bytes};
/// The seed used for the Bloom filter.
const SEED: u128 = 42;
use crate::bloom_filter::error::{IoSnafu, Result, SerdeJsonSnafu};
use crate::bloom_filter::{BloomFilterMeta, BloomFilterSegmentLocation, Bytes, SEED};
use crate::external_provider::ExternalTempFileProvider;
/// The false positive rate of the Bloom filter.
const FALSE_POSITIVE_RATE: f64 = 0.01;
pub const FALSE_POSITIVE_RATE: f64 = 0.01;
/// `BloomFilterCreator` is responsible for creating and managing bloom filters
/// for a set of elements. It divides the rows into segments and creates
@@ -58,6 +60,9 @@ pub struct BloomFilterCreator {
/// Storage for finalized Bloom filters.
finalized_bloom_filters: FinalizedBloomFilterStorage,
/// Global memory usage of the bloom filter creator.
global_memory_usage: Arc<AtomicUsize>,
}
impl BloomFilterCreator {
@@ -66,7 +71,12 @@ impl BloomFilterCreator {
/// # PANICS
///
/// `rows_per_segment` <= 0
pub fn new(rows_per_segment: usize) -> Self {
pub fn new(
rows_per_segment: usize,
intermediate_provider: Box<dyn ExternalTempFileProvider>,
global_memory_usage: Arc<AtomicUsize>,
global_memory_usage_threshold: Option<usize>,
) -> Self {
assert!(
rows_per_segment > 0,
"rows_per_segment must be greater than 0"
@@ -77,54 +87,67 @@ impl BloomFilterCreator {
accumulated_row_count: 0,
cur_seg_distinct_elems: HashSet::default(),
cur_seg_distinct_elems_mem_usage: 0,
finalized_bloom_filters: FinalizedBloomFilterStorage::default(),
global_memory_usage: global_memory_usage.clone(),
finalized_bloom_filters: FinalizedBloomFilterStorage::new(
intermediate_provider,
global_memory_usage,
global_memory_usage_threshold,
),
}
}
/// Adds a row of elements to the bloom filter. If the number of accumulated rows
/// reaches `rows_per_segment`, it finalizes the current segment.
pub fn push_row_elems(&mut self, elems: impl IntoIterator<Item = Bytes>) {
pub async fn push_row_elems(&mut self, elems: impl IntoIterator<Item = Bytes>) -> Result<()> {
self.accumulated_row_count += 1;
let mut mem_diff = 0;
for elem in elems.into_iter() {
let len = elem.len();
let is_new = self.cur_seg_distinct_elems.insert(elem);
if is_new {
self.cur_seg_distinct_elems_mem_usage += len;
mem_diff += len;
}
}
self.cur_seg_distinct_elems_mem_usage += mem_diff;
self.global_memory_usage
.fetch_add(mem_diff, Ordering::Relaxed);
if self.accumulated_row_count % self.rows_per_segment == 0 {
self.finalize_segment();
self.finalize_segment().await?;
}
Ok(())
}
/// Finalizes any remaining segments and writes the bloom filters and metadata to the provided writer.
pub async fn finish(&mut self, mut writer: impl AsyncWrite + Unpin) -> Result<()> {
if !self.cur_seg_distinct_elems.is_empty() {
self.finalize_segment();
self.finalize_segment().await?;
}
let mut meta = BloomFilterMeta {
rows_per_segment: self.rows_per_segment,
seg_count: self.finalized_bloom_filters.len(),
row_count: self.accumulated_row_count,
..Default::default()
};
let mut buf = Vec::new();
for segment in self.finalized_bloom_filters.drain() {
let slice = segment.bloom_filter.as_slice();
buf.clear();
write_u64_slice(&mut buf, slice);
writer.write_all(&buf).await.context(IoSnafu)?;
let mut segs = self.finalized_bloom_filters.drain().await?;
while let Some(segment) = segs.next().await {
let segment = segment?;
writer
.write_all(&segment.bloom_filter_bytes)
.await
.context(IoSnafu)?;
let size = buf.len();
let size = segment.bloom_filter_bytes.len();
meta.bloom_filter_segments.push(BloomFilterSegmentLocation {
offset: meta.bloom_filter_segments_size as _,
size: size as _,
elem_count: segment.element_count,
});
meta.bloom_filter_segments_size += size;
meta.seg_count += 1;
}
let meta_bytes = serde_json::to_vec(&meta).context(SerdeJsonSnafu)?;
@@ -145,91 +168,29 @@ impl BloomFilterCreator {
self.cur_seg_distinct_elems_mem_usage + self.finalized_bloom_filters.memory_usage()
}
fn finalize_segment(&mut self) {
async fn finalize_segment(&mut self) -> Result<()> {
let elem_count = self.cur_seg_distinct_elems.len();
self.finalized_bloom_filters
.add(self.cur_seg_distinct_elems.drain(), elem_count);
.add(self.cur_seg_distinct_elems.drain(), elem_count)
.await?;
self.global_memory_usage
.fetch_sub(self.cur_seg_distinct_elems_mem_usage, Ordering::Relaxed);
self.cur_seg_distinct_elems_mem_usage = 0;
}
}
/// Storage for finalized Bloom filters.
///
/// TODO(zhongzc): Add support for storing intermediate bloom filters on disk to control memory usage.
#[derive(Debug, Default)]
struct FinalizedBloomFilterStorage {
/// Bloom filters that are stored in memory.
in_memory: Vec<FinalizedBloomFilterSegment>,
}
impl FinalizedBloomFilterStorage {
fn memory_usage(&self) -> usize {
self.in_memory.iter().map(|s| s.size).sum()
}
/// Adds a new finalized Bloom filter to the storage.
///
/// TODO(zhongzc): Add support for flushing to disk.
fn add(&mut self, elems: impl IntoIterator<Item = Bytes>, elem_count: usize) {
let mut bf = BloomFilter::with_false_pos(FALSE_POSITIVE_RATE)
.seed(&SEED)
.expected_items(elem_count);
for elem in elems.into_iter() {
bf.insert(&elem);
}
let cbf = FinalizedBloomFilterSegment::new(bf, elem_count);
self.in_memory.push(cbf);
}
fn len(&self) -> usize {
self.in_memory.len()
}
fn drain(&mut self) -> impl Iterator<Item = FinalizedBloomFilterSegment> + '_ {
self.in_memory.drain(..)
}
}
/// A finalized Bloom filter segment.
#[derive(Debug)]
struct FinalizedBloomFilterSegment {
/// The underlying Bloom filter.
bloom_filter: BloomFilter,
/// The number of elements in the Bloom filter.
element_count: usize,
/// The occupied memory size of the Bloom filter.
size: usize,
}
impl FinalizedBloomFilterSegment {
fn new(bloom_filter: BloomFilter, elem_count: usize) -> Self {
let memory_usage = std::mem::size_of_val(bloom_filter.as_slice());
Self {
bloom_filter,
element_count: elem_count,
size: memory_usage,
}
}
}
/// Writes a slice of `u64` to the buffer in little-endian order.
fn write_u64_slice(buf: &mut Vec<u8>, slice: &[u64]) {
buf.reserve(std::mem::size_of_val(slice));
for &x in slice {
buf.extend_from_slice(&x.to_le_bytes());
Ok(())
}
}
#[cfg(test)]
mod tests {
use fastbloom::BloomFilter;
use futures::io::Cursor;
use super::*;
use crate::external_provider::MockExternalTempFileProvider;
fn u64_vec_from_bytes(bytes: &[u8]) -> Vec<u64> {
/// Converts a slice of bytes to a vector of `u64`.
pub fn u64_vec_from_bytes(bytes: &[u8]) -> Vec<u64> {
bytes
.chunks_exact(std::mem::size_of::<u64>())
.map(|chunk| u64::from_le_bytes(chunk.try_into().unwrap()))
@@ -239,18 +200,32 @@ mod tests {
#[tokio::test]
async fn test_bloom_filter_creator() {
let mut writer = Cursor::new(Vec::new());
let mut creator = BloomFilterCreator::new(2);
let mut creator = BloomFilterCreator::new(
2,
Box::new(MockExternalTempFileProvider::new()),
Arc::new(AtomicUsize::new(0)),
None,
);
creator.push_row_elems(vec![b"a".to_vec(), b"b".to_vec()]);
creator
.push_row_elems(vec![b"a".to_vec(), b"b".to_vec()])
.await
.unwrap();
assert!(creator.cur_seg_distinct_elems_mem_usage > 0);
assert!(creator.memory_usage() > 0);
creator.push_row_elems(vec![b"c".to_vec(), b"d".to_vec()]);
creator
.push_row_elems(vec![b"c".to_vec(), b"d".to_vec()])
.await
.unwrap();
// Finalize the first segment
assert!(creator.cur_seg_distinct_elems_mem_usage == 0);
assert_eq!(creator.cur_seg_distinct_elems_mem_usage, 0);
assert!(creator.memory_usage() > 0);
creator.push_row_elems(vec![b"e".to_vec(), b"f".to_vec()]);
creator
.push_row_elems(vec![b"e".to_vec(), b"f".to_vec()])
.await
.unwrap();
assert!(creator.cur_seg_distinct_elems_mem_usage > 0);
assert!(creator.memory_usage() > 0);

View File

@@ -0,0 +1,293 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::pin::Pin;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use asynchronous_codec::{FramedRead, FramedWrite};
use fastbloom::BloomFilter;
use futures::stream::StreamExt;
use futures::{stream, AsyncWriteExt, Stream};
use snafu::ResultExt;
use super::intermediate_codec::IntermediateBloomFilterCodecV1;
use crate::bloom_filter::creator::{FALSE_POSITIVE_RATE, SEED};
use crate::bloom_filter::error::{IntermediateSnafu, IoSnafu, Result};
use crate::bloom_filter::Bytes;
use crate::external_provider::ExternalTempFileProvider;
/// The minimum memory usage threshold for flushing in-memory Bloom filters to disk.
const MIN_MEMORY_USAGE_THRESHOLD: usize = 1024 * 1024; // 1MB
/// Storage for finalized Bloom filters.
pub struct FinalizedBloomFilterStorage {
/// Bloom filters that are stored in memory.
in_memory: Vec<FinalizedBloomFilterSegment>,
/// Used to generate unique file IDs for intermediate Bloom filters.
intermediate_file_id_counter: usize,
/// Prefix for intermediate Bloom filter files.
intermediate_prefix: String,
/// The provider for intermediate Bloom filter files.
intermediate_provider: Box<dyn ExternalTempFileProvider>,
/// The memory usage of the in-memory Bloom filters.
memory_usage: usize,
/// The global memory usage provided by the user to track the
/// total memory usage of the creating Bloom filters.
global_memory_usage: Arc<AtomicUsize>,
/// The threshold of the global memory usage of the creating Bloom filters.
global_memory_usage_threshold: Option<usize>,
}
impl FinalizedBloomFilterStorage {
/// Creates a new `FinalizedBloomFilterStorage`.
pub fn new(
intermediate_provider: Box<dyn ExternalTempFileProvider>,
global_memory_usage: Arc<AtomicUsize>,
global_memory_usage_threshold: Option<usize>,
) -> Self {
let external_prefix = format!("intm-bloom-filters-{}", uuid::Uuid::new_v4());
Self {
in_memory: Vec::new(),
intermediate_file_id_counter: 0,
intermediate_prefix: external_prefix,
intermediate_provider,
memory_usage: 0,
global_memory_usage,
global_memory_usage_threshold,
}
}
/// Returns the memory usage of the storage.
pub fn memory_usage(&self) -> usize {
self.memory_usage
}
/// Adds a new finalized Bloom filter to the storage.
///
/// If the memory usage exceeds the threshold, flushes the in-memory Bloom filters to disk.
pub async fn add(
&mut self,
elems: impl IntoIterator<Item = Bytes>,
element_count: usize,
) -> Result<()> {
let mut bf = BloomFilter::with_false_pos(FALSE_POSITIVE_RATE)
.seed(&SEED)
.expected_items(element_count);
for elem in elems.into_iter() {
bf.insert(&elem);
}
let fbf = FinalizedBloomFilterSegment::from(bf, element_count);
// Update memory usage.
let memory_diff = fbf.bloom_filter_bytes.len();
self.memory_usage += memory_diff;
self.global_memory_usage
.fetch_add(memory_diff, Ordering::Relaxed);
// Add the finalized Bloom filter to the in-memory storage.
self.in_memory.push(fbf);
// Flush to disk if necessary.
// Do not flush if memory usage is too low.
if self.memory_usage < MIN_MEMORY_USAGE_THRESHOLD {
return Ok(());
}
// Check if the global memory usage exceeds the threshold and flush to disk if necessary.
if let Some(threshold) = self.global_memory_usage_threshold {
let global = self.global_memory_usage.load(Ordering::Relaxed);
if global > threshold {
self.flush_in_memory_to_disk().await?;
self.global_memory_usage
.fetch_sub(self.memory_usage, Ordering::Relaxed);
self.memory_usage = 0;
}
}
Ok(())
}
/// Drains the storage and returns a stream of finalized Bloom filter segments.
pub async fn drain(
&mut self,
) -> Result<Pin<Box<dyn Stream<Item = Result<FinalizedBloomFilterSegment>> + '_>>> {
// FAST PATH: memory only
if self.intermediate_file_id_counter == 0 {
return Ok(Box::pin(stream::iter(self.in_memory.drain(..).map(Ok))));
}
// SLOW PATH: memory + disk
let mut on_disk = self
.intermediate_provider
.read_all(&self.intermediate_prefix)
.await
.context(IntermediateSnafu)?;
on_disk.sort_unstable_by(|x, y| x.0.cmp(&y.0));
let streams = on_disk
.into_iter()
.map(|(_, reader)| FramedRead::new(reader, IntermediateBloomFilterCodecV1::default()));
let in_memory_stream = stream::iter(self.in_memory.drain(..)).map(Ok);
Ok(Box::pin(
stream::iter(streams).flatten().chain(in_memory_stream),
))
}
/// Flushes the in-memory Bloom filters to disk.
async fn flush_in_memory_to_disk(&mut self) -> Result<()> {
let file_id = self.intermediate_file_id_counter;
self.intermediate_file_id_counter += 1;
let file_id = format!("{:08}", file_id);
let mut writer = self
.intermediate_provider
.create(&self.intermediate_prefix, &file_id)
.await
.context(IntermediateSnafu)?;
let fw = FramedWrite::new(&mut writer, IntermediateBloomFilterCodecV1::default());
// `forward()` will flush and close the writer when the stream ends
if let Err(e) = stream::iter(self.in_memory.drain(..).map(Ok))
.forward(fw)
.await
{
writer.close().await.context(IoSnafu)?;
writer.flush().await.context(IoSnafu)?;
return Err(e);
}
Ok(())
}
}
/// A finalized Bloom filter segment.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct FinalizedBloomFilterSegment {
/// The underlying Bloom filter bytes.
pub bloom_filter_bytes: Vec<u8>,
/// The number of elements in the Bloom filter.
pub element_count: usize,
}
impl FinalizedBloomFilterSegment {
fn from(bf: BloomFilter, elem_count: usize) -> Self {
let bf_slice = bf.as_slice();
let mut bloom_filter_bytes = Vec::with_capacity(std::mem::size_of_val(bf_slice));
for &x in bf_slice {
bloom_filter_bytes.extend_from_slice(&x.to_le_bytes());
}
Self {
bloom_filter_bytes,
element_count: elem_count,
}
}
}
#[cfg(test)]
mod tests {
use std::collections::HashMap;
use std::sync::Mutex;
use futures::AsyncRead;
use tokio::io::duplex;
use tokio_util::compat::{TokioAsyncReadCompatExt, TokioAsyncWriteCompatExt};
use super::*;
use crate::bloom_filter::creator::tests::u64_vec_from_bytes;
use crate::external_provider::MockExternalTempFileProvider;
#[tokio::test]
async fn test_finalized_bloom_filter_storage() {
let mut mock_provider = MockExternalTempFileProvider::new();
let mock_files: Arc<Mutex<HashMap<String, Box<dyn AsyncRead + Unpin + Send>>>> =
Arc::new(Mutex::new(HashMap::new()));
mock_provider.expect_create().returning({
let files = Arc::clone(&mock_files);
move |file_group, file_id| {
assert!(file_group.starts_with("intm-bloom-filters-"));
let mut files = files.lock().unwrap();
let (writer, reader) = duplex(2 * 1024 * 1024);
files.insert(file_id.to_string(), Box::new(reader.compat()));
Ok(Box::new(writer.compat_write()))
}
});
mock_provider.expect_read_all().returning({
let files = Arc::clone(&mock_files);
move |file_group| {
assert!(file_group.starts_with("intm-bloom-filters-"));
let mut files = files.lock().unwrap();
Ok(files.drain().collect::<Vec<_>>())
}
});
let global_memory_usage = Arc::new(AtomicUsize::new(0));
let global_memory_usage_threshold = Some(1024 * 1024); // 1MB
let provider = Box::new(mock_provider);
let mut storage = FinalizedBloomFilterStorage::new(
provider,
global_memory_usage.clone(),
global_memory_usage_threshold,
);
let elem_count = 2000;
let batch = 1000;
for i in 0..batch {
let elems = (elem_count * i..elem_count * (i + 1)).map(|x| x.to_string().into_bytes());
storage.add(elems, elem_count).await.unwrap();
}
// Flush happens.
assert!(storage.intermediate_file_id_counter > 0);
// Drain the storage.
let mut stream = storage.drain().await.unwrap();
let mut i = 0;
while let Some(segment) = stream.next().await {
let segment = segment.unwrap();
assert_eq!(segment.element_count, elem_count);
let v = u64_vec_from_bytes(&segment.bloom_filter_bytes);
// Check the correctness of the Bloom filter.
let bf = BloomFilter::from_vec(v)
.seed(&SEED)
.expected_items(segment.element_count);
for elem in (elem_count * i..elem_count * (i + 1)).map(|x| x.to_string().into_bytes()) {
assert!(bf.contains(&elem));
}
i += 1;
}
assert_eq!(i, batch);
}
}

View File

@@ -0,0 +1,248 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use asynchronous_codec::{BytesMut, Decoder, Encoder};
use bytes::{Buf, BufMut};
use snafu::{ensure, ResultExt};
use crate::bloom_filter::creator::finalize_segment::FinalizedBloomFilterSegment;
use crate::bloom_filter::error::{Error, InvalidIntermediateMagicSnafu, IoSnafu, Result};
/// The magic number for the codec version 1 of the intermediate bloom filter.
const CODEC_V1_MAGIC: &[u8; 4] = b"bi01";
/// Codec of the intermediate finalized bloom filter segment.
///
/// # Format
///
/// [ magic ][ elem count ][ size ][ bloom filter ][ elem count ][ size ][ bloom filter ]...
/// [4] [8] [8] [size] [8] [8] [size]
#[derive(Debug, Default)]
pub struct IntermediateBloomFilterCodecV1 {
handled_header_magic: bool,
}
impl Encoder for IntermediateBloomFilterCodecV1 {
type Item<'a> = FinalizedBloomFilterSegment;
type Error = Error;
fn encode(&mut self, item: FinalizedBloomFilterSegment, dst: &mut BytesMut) -> Result<()> {
if !self.handled_header_magic {
dst.extend_from_slice(CODEC_V1_MAGIC);
self.handled_header_magic = true;
}
let segment_bytes = item.bloom_filter_bytes;
let elem_count = item.element_count;
dst.reserve(2 * std::mem::size_of::<u64>() + segment_bytes.len());
dst.put_u64_le(elem_count as u64);
dst.put_u64_le(segment_bytes.len() as u64);
dst.extend_from_slice(&segment_bytes);
Ok(())
}
}
impl Decoder for IntermediateBloomFilterCodecV1 {
type Item = FinalizedBloomFilterSegment;
type Error = Error;
fn decode(&mut self, src: &mut BytesMut) -> Result<Option<Self::Item>> {
if !self.handled_header_magic {
let m_len = CODEC_V1_MAGIC.len();
if src.remaining() < m_len {
return Ok(None);
}
let magic_bytes = &src[..m_len];
ensure!(
magic_bytes == CODEC_V1_MAGIC,
InvalidIntermediateMagicSnafu {
invalid: magic_bytes,
}
);
self.handled_header_magic = true;
src.advance(m_len);
}
let s = &src[..];
let u64_size = std::mem::size_of::<u64>();
let n_size = u64_size * 2;
if s.len() < n_size {
return Ok(None);
}
let element_count = u64::from_le_bytes(s[0..u64_size].try_into().unwrap()) as usize;
let segment_size = u64::from_le_bytes(s[u64_size..n_size].try_into().unwrap()) as usize;
if s.len() < n_size + segment_size {
return Ok(None);
}
let bloom_filter_bytes = s[n_size..n_size + segment_size].to_vec();
src.advance(n_size + segment_size);
Ok(Some(FinalizedBloomFilterSegment {
element_count,
bloom_filter_bytes,
}))
}
}
/// Required for [`Encoder`] and [`Decoder`] implementations.
impl From<std::io::Error> for Error {
fn from(error: std::io::Error) -> Self {
Err::<(), std::io::Error>(error)
.context(IoSnafu)
.unwrap_err()
}
}
#[cfg(test)]
mod tests {
use asynchronous_codec::{FramedRead, FramedWrite};
use futures::io::Cursor;
use futures::{SinkExt, StreamExt};
use super::*;
use crate::bloom_filter::creator::finalize_segment::FinalizedBloomFilterSegment;
#[test]
fn test_intermediate_bloom_filter_codec_v1_basic() {
let mut encoder = IntermediateBloomFilterCodecV1::default();
let mut buf = BytesMut::new();
let item1 = FinalizedBloomFilterSegment {
element_count: 2,
bloom_filter_bytes: vec![1, 2, 3, 4],
};
let item2 = FinalizedBloomFilterSegment {
element_count: 3,
bloom_filter_bytes: vec![5, 6, 7, 8],
};
let item3 = FinalizedBloomFilterSegment {
element_count: 4,
bloom_filter_bytes: vec![9, 10, 11, 12],
};
encoder.encode(item1.clone(), &mut buf).unwrap();
encoder.encode(item2.clone(), &mut buf).unwrap();
encoder.encode(item3.clone(), &mut buf).unwrap();
let mut buf = buf.freeze().try_into_mut().unwrap();
let mut decoder = IntermediateBloomFilterCodecV1::default();
let decoded_item1 = decoder.decode(&mut buf).unwrap().unwrap();
let decoded_item2 = decoder.decode(&mut buf).unwrap().unwrap();
let decoded_item3 = decoder.decode(&mut buf).unwrap().unwrap();
assert_eq!(item1, decoded_item1);
assert_eq!(item2, decoded_item2);
assert_eq!(item3, decoded_item3);
}
#[tokio::test]
async fn test_intermediate_bloom_filter_codec_v1_frame_read_write() {
let item1 = FinalizedBloomFilterSegment {
element_count: 2,
bloom_filter_bytes: vec![1, 2, 3, 4],
};
let item2 = FinalizedBloomFilterSegment {
element_count: 3,
bloom_filter_bytes: vec![5, 6, 7, 8],
};
let item3 = FinalizedBloomFilterSegment {
element_count: 4,
bloom_filter_bytes: vec![9, 10, 11, 12],
};
let mut bytes = Cursor::new(vec![]);
let mut writer = FramedWrite::new(&mut bytes, IntermediateBloomFilterCodecV1::default());
writer.send(item1.clone()).await.unwrap();
writer.send(item2.clone()).await.unwrap();
writer.send(item3.clone()).await.unwrap();
writer.flush().await.unwrap();
writer.close().await.unwrap();
let bytes = bytes.into_inner();
let mut reader =
FramedRead::new(bytes.as_slice(), IntermediateBloomFilterCodecV1::default());
let decoded_item1 = reader.next().await.unwrap().unwrap();
let decoded_item2 = reader.next().await.unwrap().unwrap();
let decoded_item3 = reader.next().await.unwrap().unwrap();
assert!(reader.next().await.is_none());
assert_eq!(item1, decoded_item1);
assert_eq!(item2, decoded_item2);
assert_eq!(item3, decoded_item3);
}
#[tokio::test]
async fn test_intermediate_bloom_filter_codec_v1_frame_read_write_only_magic() {
let bytes = CODEC_V1_MAGIC.to_vec();
let mut reader =
FramedRead::new(bytes.as_slice(), IntermediateBloomFilterCodecV1::default());
assert!(reader.next().await.is_none());
}
#[tokio::test]
async fn test_intermediate_bloom_filter_codec_v1_frame_read_write_partial_magic() {
let bytes = CODEC_V1_MAGIC[..3].to_vec();
let mut reader =
FramedRead::new(bytes.as_slice(), IntermediateBloomFilterCodecV1::default());
let e = reader.next().await.unwrap();
assert!(e.is_err());
}
#[tokio::test]
async fn test_intermediate_bloom_filter_codec_v1_frame_read_write_partial_item() {
let mut bytes = vec![];
bytes.extend_from_slice(CODEC_V1_MAGIC);
bytes.extend_from_slice(&2u64.to_le_bytes());
bytes.extend_from_slice(&4u64.to_le_bytes());
let mut reader =
FramedRead::new(bytes.as_slice(), IntermediateBloomFilterCodecV1::default());
let e = reader.next().await.unwrap();
assert!(e.is_err());
}
#[tokio::test]
async fn test_intermediate_bloom_filter_codec_v1_frame_read_write_corrupted_magic() {
let mut bytes = vec![];
bytes.extend_from_slice(b"bi02");
bytes.extend_from_slice(&2u64.to_le_bytes());
bytes.extend_from_slice(&4u64.to_le_bytes());
bytes.extend_from_slice(&[1, 2, 3, 4]);
let mut reader =
FramedRead::new(bytes.as_slice(), IntermediateBloomFilterCodecV1::default());
let e = reader.next().await.unwrap();
assert!(e.is_err());
}
#[tokio::test]
async fn test_intermediate_bloom_filter_codec_v1_frame_read_write_corrupted_length() {
let mut bytes = vec![];
bytes.extend_from_slice(CODEC_V1_MAGIC);
bytes.extend_from_slice(&2u64.to_le_bytes());
bytes.extend_from_slice(&4u64.to_le_bytes());
bytes.extend_from_slice(&[1, 2, 3]);
let mut reader =
FramedRead::new(bytes.as_slice(), IntermediateBloomFilterCodecV1::default());
let e = reader.next().await.unwrap();
assert!(e.is_err());
}
}

View File

@@ -39,6 +39,43 @@ pub enum Error {
location: Location,
},
#[snafu(display("Failed to deserialize json"))]
DeserializeJson {
#[snafu(source)]
error: serde_json::Error,
#[snafu(implicit)]
location: Location,
},
#[snafu(display("Intermediate error"))]
Intermediate {
source: crate::error::Error,
#[snafu(implicit)]
location: Location,
},
#[snafu(display("File size too small for bloom filter"))]
FileSizeTooSmall {
size: u64,
#[snafu(implicit)]
location: Location,
},
#[snafu(display("Unexpected bloom filter meta size"))]
UnexpectedMetaSize {
max_meta_size: u64,
actual_meta_size: u64,
#[snafu(implicit)]
location: Location,
},
#[snafu(display("Invalid intermediate magic"))]
InvalidIntermediateMagic {
invalid: Vec<u8>,
#[snafu(implicit)]
location: Location,
},
#[snafu(display("External error"))]
External {
source: BoxedError,
@@ -52,8 +89,14 @@ impl ErrorExt for Error {
use Error::*;
match self {
Io { .. } | Self::SerdeJson { .. } => StatusCode::Unexpected,
Io { .. }
| SerdeJson { .. }
| FileSizeTooSmall { .. }
| UnexpectedMetaSize { .. }
| DeserializeJson { .. }
| InvalidIntermediateMagic { .. } => StatusCode::Unexpected,
Intermediate { source, .. } => source.status_code(),
External { source, .. } => source.status_code(),
}
}

View File

@@ -0,0 +1,265 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::ops::Range;
use async_trait::async_trait;
use bytes::Bytes;
use common_base::range_read::RangeReader;
use fastbloom::BloomFilter;
use snafu::{ensure, ResultExt};
use crate::bloom_filter::error::{
DeserializeJsonSnafu, FileSizeTooSmallSnafu, IoSnafu, Result, UnexpectedMetaSizeSnafu,
};
use crate::bloom_filter::{BloomFilterMeta, BloomFilterSegmentLocation, SEED};
/// Minimum size of the bloom filter, which is the size of the length of the bloom filter.
const BLOOM_META_LEN_SIZE: u64 = 4;
/// Default prefetch size of bloom filter meta.
pub const DEFAULT_PREFETCH_SIZE: u64 = 1024; // 1KiB
/// `BloomFilterReader` reads the bloom filter from the file.
#[async_trait]
pub trait BloomFilterReader {
/// Reads range of bytes from the file.
async fn range_read(&mut self, offset: u64, size: u32) -> Result<Bytes>;
/// Reads bunch of ranges from the file.
async fn read_vec(&mut self, ranges: &[Range<u64>]) -> Result<Vec<Bytes>>;
/// Reads the meta information of the bloom filter.
async fn metadata(&mut self) -> Result<BloomFilterMeta>;
/// Reads a bloom filter with the given location.
async fn bloom_filter(&mut self, loc: &BloomFilterSegmentLocation) -> Result<BloomFilter> {
let bytes = self.range_read(loc.offset, loc.size as _).await?;
let vec = bytes
.chunks_exact(std::mem::size_of::<u64>())
.map(|chunk| u64::from_le_bytes(chunk.try_into().unwrap()))
.collect();
let bm = BloomFilter::from_vec(vec)
.seed(&SEED)
.expected_items(loc.elem_count);
Ok(bm)
}
}
/// `BloomFilterReaderImpl` reads the bloom filter from the file.
pub struct BloomFilterReaderImpl<R: RangeReader> {
/// The underlying reader.
reader: R,
}
impl<R: RangeReader> BloomFilterReaderImpl<R> {
/// Creates a new `BloomFilterReaderImpl` with the given reader.
pub fn new(reader: R) -> Self {
Self { reader }
}
}
#[async_trait]
impl<R: RangeReader> BloomFilterReader for BloomFilterReaderImpl<R> {
async fn range_read(&mut self, offset: u64, size: u32) -> Result<Bytes> {
self.reader
.read(offset..offset + size as u64)
.await
.context(IoSnafu)
}
async fn read_vec(&mut self, ranges: &[Range<u64>]) -> Result<Vec<Bytes>> {
self.reader.read_vec(ranges).await.context(IoSnafu)
}
async fn metadata(&mut self) -> Result<BloomFilterMeta> {
let metadata = self.reader.metadata().await.context(IoSnafu)?;
let file_size = metadata.content_length;
let mut meta_reader =
BloomFilterMetaReader::new(&mut self.reader, file_size, Some(DEFAULT_PREFETCH_SIZE));
meta_reader.metadata().await
}
}
/// `BloomFilterMetaReader` reads the metadata of the bloom filter.
struct BloomFilterMetaReader<R: RangeReader> {
reader: R,
file_size: u64,
prefetch_size: u64,
}
impl<R: RangeReader> BloomFilterMetaReader<R> {
pub fn new(reader: R, file_size: u64, prefetch_size: Option<u64>) -> Self {
Self {
reader,
file_size,
prefetch_size: prefetch_size
.unwrap_or(BLOOM_META_LEN_SIZE)
.max(BLOOM_META_LEN_SIZE),
}
}
/// Reads the metadata of the bloom filter.
///
/// It will first prefetch some bytes from the end of the file,
/// then parse the metadata from the prefetch bytes.
pub async fn metadata(&mut self) -> Result<BloomFilterMeta> {
ensure!(
self.file_size >= BLOOM_META_LEN_SIZE,
FileSizeTooSmallSnafu {
size: self.file_size,
}
);
let meta_start = self.file_size.saturating_sub(self.prefetch_size);
let suffix = self
.reader
.read(meta_start..self.file_size)
.await
.context(IoSnafu)?;
let suffix_len = suffix.len();
let length = u32::from_le_bytes(Self::read_tailing_four_bytes(&suffix)?) as u64;
self.validate_meta_size(length)?;
if length > suffix_len as u64 - BLOOM_META_LEN_SIZE {
let metadata_start = self.file_size - length - BLOOM_META_LEN_SIZE;
let meta = self
.reader
.read(metadata_start..self.file_size - BLOOM_META_LEN_SIZE)
.await
.context(IoSnafu)?;
serde_json::from_slice(&meta).context(DeserializeJsonSnafu)
} else {
let metadata_start = self.file_size - length - BLOOM_META_LEN_SIZE - meta_start;
let meta = &suffix[metadata_start as usize..suffix_len - BLOOM_META_LEN_SIZE as usize];
serde_json::from_slice(meta).context(DeserializeJsonSnafu)
}
}
fn read_tailing_four_bytes(suffix: &[u8]) -> Result<[u8; 4]> {
let suffix_len = suffix.len();
ensure!(
suffix_len >= 4,
FileSizeTooSmallSnafu {
size: suffix_len as u64
}
);
let mut bytes = [0; 4];
bytes.copy_from_slice(&suffix[suffix_len - 4..suffix_len]);
Ok(bytes)
}
fn validate_meta_size(&self, length: u64) -> Result<()> {
let max_meta_size = self.file_size - BLOOM_META_LEN_SIZE;
ensure!(
length <= max_meta_size,
UnexpectedMetaSizeSnafu {
max_meta_size,
actual_meta_size: length,
}
);
Ok(())
}
}
#[cfg(test)]
mod tests {
use std::sync::atomic::AtomicUsize;
use std::sync::Arc;
use futures::io::Cursor;
use super::*;
use crate::bloom_filter::creator::BloomFilterCreator;
use crate::external_provider::MockExternalTempFileProvider;
async fn mock_bloom_filter_bytes() -> Vec<u8> {
let mut writer = Cursor::new(vec![]);
let mut creator = BloomFilterCreator::new(
2,
Box::new(MockExternalTempFileProvider::new()),
Arc::new(AtomicUsize::new(0)),
None,
);
creator
.push_row_elems(vec![b"a".to_vec(), b"b".to_vec()])
.await
.unwrap();
creator
.push_row_elems(vec![b"c".to_vec(), b"d".to_vec()])
.await
.unwrap();
creator
.push_row_elems(vec![b"e".to_vec(), b"f".to_vec()])
.await
.unwrap();
creator.finish(&mut writer).await.unwrap();
writer.into_inner()
}
#[tokio::test]
async fn test_bloom_filter_meta_reader() {
let bytes = mock_bloom_filter_bytes().await;
let file_size = bytes.len() as u64;
for prefetch in [0u64, file_size / 2, file_size, file_size + 10] {
let mut reader =
BloomFilterMetaReader::new(bytes.clone(), file_size as _, Some(prefetch));
let meta = reader.metadata().await.unwrap();
assert_eq!(meta.rows_per_segment, 2);
assert_eq!(meta.seg_count, 2);
assert_eq!(meta.row_count, 3);
assert_eq!(meta.bloom_filter_segments.len(), 2);
assert_eq!(meta.bloom_filter_segments[0].offset, 0);
assert_eq!(meta.bloom_filter_segments[0].elem_count, 4);
assert_eq!(
meta.bloom_filter_segments[1].offset,
meta.bloom_filter_segments[0].size
);
assert_eq!(meta.bloom_filter_segments[1].elem_count, 2);
}
}
#[tokio::test]
async fn test_bloom_filter_reader() {
let bytes = mock_bloom_filter_bytes().await;
let mut reader = BloomFilterReaderImpl::new(bytes);
let meta = reader.metadata().await.unwrap();
assert_eq!(meta.bloom_filter_segments.len(), 2);
let bf = reader
.bloom_filter(&meta.bloom_filter_segments[0])
.await
.unwrap();
assert!(bf.contains(&b"a"));
assert!(bf.contains(&b"b"));
assert!(bf.contains(&b"c"));
assert!(bf.contains(&b"d"));
let bf = reader
.bloom_filter(&meta.bloom_filter_segments[1])
.await
.unwrap();
assert!(bf.contains(&b"e"));
assert!(bf.contains(&b"f"));
}
}

48
src/index/src/error.rs Normal file
View File

@@ -0,0 +1,48 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::any::Any;
use common_error::ext::{BoxedError, ErrorExt};
use common_error::status_code::StatusCode;
use common_macro::stack_trace_debug;
use snafu::{Location, Snafu};
#[derive(Snafu)]
#[snafu(visibility(pub))]
#[stack_trace_debug]
pub enum Error {
#[snafu(display("External error"))]
External {
source: BoxedError,
#[snafu(implicit)]
location: Location,
},
}
impl ErrorExt for Error {
fn status_code(&self) -> StatusCode {
use Error::*;
match self {
External { source, .. } => source.status_code(),
}
}
fn as_any(&self) -> &dyn Any {
self
}
}
pub type Result<T> = std::result::Result<T, Error>;

View File

@@ -15,25 +15,24 @@
use async_trait::async_trait;
use futures::{AsyncRead, AsyncWrite};
use crate::inverted_index::error::Result;
use crate::error::Error;
/// Trait for managing intermediate files during external sorting for a particular index.
pub type Writer = Box<dyn AsyncWrite + Unpin + Send>;
pub type Reader = Box<dyn AsyncRead + Unpin + Send>;
/// Trait for managing intermediate files to control memory usage for a particular index.
#[mockall::automock]
#[async_trait]
pub trait ExternalTempFileProvider: Send + Sync {
/// Creates and opens a new intermediate file associated with a specific index for writing.
/// Creates and opens a new intermediate file associated with a specific `file_group` for writing.
/// The implementation should ensure that the file does not already exist.
///
/// - `index_name`: the name of the index for which the file will be associated
/// - `file_group`: a unique identifier for the group of files
/// - `file_id`: a unique identifier for the new file
async fn create(
&self,
index_name: &str,
file_id: &str,
) -> Result<Box<dyn AsyncWrite + Unpin + Send>>;
async fn create(&self, file_group: &str, file_id: &str) -> Result<Writer, Error>;
/// Retrieves all intermediate files associated with a specific index for an external sorting operation.
/// Retrieves all intermediate files and their associated file identifiers for a specific `file_group`.
///
/// `index_name`: the name of the index to retrieve intermediate files for
async fn read_all(&self, index_name: &str) -> Result<Vec<Box<dyn AsyncRead + Unpin + Send>>>;
/// `file_group` is a unique identifier for the group of files.
async fn read_all(&self, file_group: &str) -> Result<Vec<(String, Reader)>, Error>;
}

View File

@@ -12,7 +12,6 @@
// See the License for the specific language governing permissions and
// limitations under the License.
pub mod external_provider;
pub mod external_sort;
mod intermediate_rw;
mod merge_stream;

View File

@@ -23,15 +23,16 @@ use async_trait::async_trait;
use common_base::BitVec;
use common_telemetry::{debug, error};
use futures::stream;
use snafu::ResultExt;
use crate::inverted_index::create::sort::external_provider::ExternalTempFileProvider;
use crate::external_provider::ExternalTempFileProvider;
use crate::inverted_index::create::sort::intermediate_rw::{
IntermediateReader, IntermediateWriter,
};
use crate::inverted_index::create::sort::merge_stream::MergeSortedStream;
use crate::inverted_index::create::sort::{SortOutput, SortedStream, Sorter};
use crate::inverted_index::create::sort_create::SorterFactory;
use crate::inverted_index::error::Result;
use crate::inverted_index::error::{IntermediateSnafu, Result};
use crate::inverted_index::{Bytes, BytesRef};
/// `ExternalSorter` manages the sorting of data using both in-memory structures and external files.
@@ -107,7 +108,11 @@ impl Sorter for ExternalSorter {
/// Finalizes the sorting operation, merging data from both in-memory buffer and external files
/// into a sorted stream
async fn output(&mut self) -> Result<SortOutput> {
let readers = self.temp_file_provider.read_all(&self.index_name).await?;
let readers = self
.temp_file_provider
.read_all(&self.index_name)
.await
.context(IntermediateSnafu)?;
// TODO(zhongzc): k-way merge instead of 2-way merge
@@ -122,7 +127,7 @@ impl Sorter for ExternalSorter {
Ok((value, bitmap))
}),
)));
for reader in readers {
for (_, reader) in readers {
tree_nodes.push_back(IntermediateReader::new(reader).into_stream().await?);
}
@@ -241,7 +246,11 @@ impl ExternalSorter {
let file_id = &format!("{:012}", self.total_row_count);
let index_name = &self.index_name;
let writer = self.temp_file_provider.create(index_name, file_id).await?;
let writer = self
.temp_file_provider
.create(index_name, file_id)
.await
.context(IntermediateSnafu)?;
let values = mem::take(&mut self.values_buffer);
self.global_memory_usage
@@ -302,7 +311,7 @@ mod tests {
use tokio_util::compat::{TokioAsyncReadCompatExt, TokioAsyncWriteCompatExt};
use super::*;
use crate::inverted_index::create::sort::external_provider::MockExternalTempFileProvider;
use crate::external_provider::MockExternalTempFileProvider;
async fn test_external_sorter(
current_memory_usage_threshold: Option<usize>,
@@ -332,7 +341,7 @@ mod tests {
move |index_name| {
assert_eq!(index_name, "test");
let mut files = files.lock().unwrap();
Ok(files.drain().map(|f| f.1).collect::<Vec<_>>())
Ok(files.drain().collect::<Vec<_>>())
}
});

View File

@@ -213,6 +213,13 @@ pub enum Error {
#[snafu(implicit)]
location: Location,
},
#[snafu(display("Intermediate error"))]
Intermediate {
source: crate::error::Error,
#[snafu(implicit)]
location: Location,
},
}
impl ErrorExt for Error {
@@ -245,6 +252,7 @@ impl ErrorExt for Error {
| InconsistentRowCount { .. }
| IndexNotFound { .. } => StatusCode::InvalidArguments,
Intermediate { source, .. } => source.status_code(),
External { source, .. } => source.status_code(),
}
}

View File

@@ -31,12 +31,21 @@ mod footer;
/// InvertedIndexReader defines an asynchronous reader of inverted index data
#[mockall::automock]
#[async_trait]
pub trait InvertedIndexReader: Send {
pub trait InvertedIndexReader: Send + Sync {
/// Seeks to given offset and reads data with exact size as provided.
async fn range_read(&mut self, offset: u64, size: u32) -> Result<Vec<u8>>;
/// Reads the bytes in the given ranges.
async fn read_vec(&mut self, ranges: &[Range<u64>]) -> Result<Vec<Bytes>>;
async fn read_vec(&mut self, ranges: &[Range<u64>]) -> Result<Vec<Bytes>> {
let mut result = Vec::with_capacity(ranges.len());
for range in ranges {
let data = self
.range_read(range.start, (range.end - range.start) as u32)
.await?;
result.push(Bytes::from(data));
}
Ok(result)
}
/// Retrieves metadata of all inverted indices stored within the blob.
async fn metadata(&mut self) -> Result<Arc<InvertedIndexMetas>>;

View File

@@ -51,7 +51,7 @@ impl<R> InvertedIndexBlobReader<R> {
}
#[async_trait]
impl<R: RangeReader> InvertedIndexReader for InvertedIndexBlobReader<R> {
impl<R: RangeReader + Sync> InvertedIndexReader for InvertedIndexBlobReader<R> {
async fn range_read(&mut self, offset: u64, size: u32) -> Result<Vec<u8>> {
let buf = self
.source

View File

@@ -16,5 +16,7 @@
#![feature(assert_matches)]
pub mod bloom_filter;
pub mod error;
pub mod external_provider;
pub mod fulltext_index;
pub mod inverted_index;

View File

@@ -11,5 +11,6 @@ workspace = true
chrono.workspace = true
common-error.workspace = true
common-macro.workspace = true
serde.workspace = true
snafu.workspace = true
table.workspace = true

View File

@@ -15,6 +15,7 @@
use std::any::Any;
use common_error::ext::ErrorExt;
use common_error::status_code::StatusCode;
use common_macro::stack_trace_debug;
use snafu::Snafu;
@@ -41,6 +42,15 @@ impl ErrorExt for Error {
fn as_any(&self) -> &dyn Any {
self
}
fn status_code(&self) -> StatusCode {
match self {
Error::InvalidTimeFilter { .. }
| Error::InvalidDateFormat { .. }
| Error::InvalidSpanFormat { .. }
| Error::EndBeforeStart { .. } => StatusCode::InvalidArguments,
}
}
}
pub type Result<T> = std::result::Result<T, Error>;

View File

@@ -13,6 +13,7 @@
// limitations under the License.
use chrono::{DateTime, Datelike, Duration, NaiveDate, NaiveTime, TimeZone, Utc};
use serde::{Deserialize, Serialize};
use table::table_name::TableName;
use crate::error::{
@@ -21,9 +22,10 @@ use crate::error::{
};
/// GreptimeDB's log query request.
#[derive(Debug, Serialize, Deserialize)]
pub struct LogQuery {
/// A fully qualified table name to query logs from.
pub table_name: TableName,
pub table: TableName,
/// Specifies the time range for the log query. See [`TimeFilter`] for more details.
pub time_filter: TimeFilter,
/// Columns with filters to query.
@@ -34,6 +36,18 @@ pub struct LogQuery {
pub context: Context,
}
impl Default for LogQuery {
fn default() -> Self {
Self {
table: TableName::new("", "", ""),
time_filter: Default::default(),
columns: vec![],
limit: None,
context: Default::default(),
}
}
}
/// Represents a time range for log query.
///
/// This struct allows various formats to express a time range from the user side
@@ -58,7 +72,7 @@ pub struct LogQuery {
///
/// This struct doesn't require a timezone to be presented. When the timezone is not
/// provided, it will fill the default timezone with the same rules akin to other queries.
#[derive(Debug, Clone)]
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
pub struct TimeFilter {
pub start: Option<String>,
pub end: Option<String>,
@@ -69,8 +83,7 @@ impl TimeFilter {
/// Validate and canonicalize the time filter.
///
/// This function will try to fill the missing fields and convert all dates to timestamps
// false positive
#[allow(unused_assignments)]
#[allow(unused_assignments)] // false positive
pub fn canonicalize(&mut self) -> Result<()> {
let mut start_dt = None;
let mut end_dt = None;
@@ -209,6 +222,7 @@ impl TimeFilter {
}
/// Represents a column with filters to query.
#[derive(Debug, Serialize, Deserialize)]
pub struct ColumnFilters {
/// Case-sensitive column name to query.
pub column_name: String,
@@ -216,6 +230,7 @@ pub struct ColumnFilters {
pub filters: Vec<ContentFilter>,
}
#[derive(Debug, Serialize, Deserialize)]
pub enum ContentFilter {
/// Only match the exact content.
///
@@ -234,13 +249,16 @@ pub enum ContentFilter {
Compound(Vec<ContentFilter>, BinaryOperator),
}
#[derive(Debug, Serialize, Deserialize)]
pub enum BinaryOperator {
And,
Or,
}
/// Controls how many adjacent lines to return.
#[derive(Debug, Default, Serialize, Deserialize)]
pub enum Context {
#[default]
None,
/// Specify the number of lines before and after the matched line separately.
Lines(usize, usize),

View File

@@ -25,6 +25,7 @@ use std::sync::Arc;
use api::v1::meta::{ProcedureDetailResponse, Role};
use cluster::Client as ClusterClient;
pub use cluster::ClusterKvBackend;
use common_error::ext::BoxedError;
use common_grpc::channel_manager::{ChannelConfig, ChannelManager};
use common_meta::cluster::{
@@ -33,6 +34,8 @@ use common_meta::cluster::{
use common_meta::datanode::{DatanodeStatKey, DatanodeStatValue, RegionStat};
use common_meta::ddl::{ExecutorContext, ProcedureExecutor};
use common_meta::error::{self as meta_error, ExternalSnafu, Result as MetaResult};
use common_meta::key::flow::flow_state::{FlowStat, FlowStateManager};
use common_meta::kv_backend::KvBackendRef;
use common_meta::range_stream::PaginationStream;
use common_meta::rpc::ddl::{SubmitDdlTaskRequest, SubmitDdlTaskResponse};
use common_meta::rpc::procedure::{
@@ -54,7 +57,8 @@ use store::Client as StoreClient;
pub use self::heartbeat::{HeartbeatSender, HeartbeatStream};
use crate::error::{
ConvertMetaRequestSnafu, ConvertMetaResponseSnafu, Error, NotStartedSnafu, Result,
ConvertMetaRequestSnafu, ConvertMetaResponseSnafu, Error, GetFlowStatSnafu, NotStartedSnafu,
Result,
};
pub type Id = (u64, u64);
@@ -322,8 +326,8 @@ impl ClusterInfo for MetaClient {
let cluster_kv_backend = Arc::new(self.cluster_client()?);
let range_prefix = DatanodeStatKey::key_prefix_with_cluster_id(self.id.0);
let req = RangeRequest::new().with_prefix(range_prefix);
let stream = PaginationStream::new(cluster_kv_backend, req, 256, Arc::new(decode_stats))
.into_stream();
let stream =
PaginationStream::new(cluster_kv_backend, req, 256, decode_stats).into_stream();
let mut datanode_stats = stream
.try_collect::<Vec<_>>()
.await
@@ -347,6 +351,15 @@ fn decode_stats(kv: KeyValue) -> MetaResult<DatanodeStatValue> {
}
impl MetaClient {
pub async fn list_flow_stats(&self) -> Result<Option<FlowStat>> {
let cluster_backend = ClusterKvBackend::new(Arc::new(self.cluster_client()?));
let cluster_backend = Arc::new(cluster_backend) as KvBackendRef;
let flow_state_manager = FlowStateManager::new(cluster_backend);
let res = flow_state_manager.get().await.context(GetFlowStatSnafu)?;
Ok(res.map(|r| r.into()))
}
pub fn new(id: Id) -> Self {
Self {
id,
@@ -981,8 +994,7 @@ mod tests {
let req = RangeRequest::new().with_prefix(b"__prefix/");
let stream =
PaginationStream::new(Arc::new(cluster_client), req, 10, Arc::new(mock_decoder))
.into_stream();
PaginationStream::new(Arc::new(cluster_client), req, 10, mock_decoder).into_stream();
let res = stream.try_collect::<Vec<_>>().await.unwrap();
assert_eq!(10, res.len());

View File

@@ -40,8 +40,8 @@ use tonic::Status;
use crate::client::ask_leader::AskLeader;
use crate::client::{util, Id};
use crate::error::{
ConvertMetaResponseSnafu, CreateChannelSnafu, Error, IllegalGrpcClientStateSnafu, Result,
RetryTimesExceededSnafu,
ConvertMetaResponseSnafu, CreateChannelSnafu, Error, IllegalGrpcClientStateSnafu,
ReadOnlyKvBackendSnafu, Result, RetryTimesExceededSnafu,
};
#[derive(Clone, Debug)]
@@ -308,3 +308,75 @@ impl Inner {
.map(|res| (res.leader, res.followers))
}
}
/// A client for the cluster info. Read only and corresponding to
/// `in_memory` kvbackend in the meta-srv.
#[derive(Clone, Debug)]
pub struct ClusterKvBackend {
inner: Arc<Client>,
}
impl ClusterKvBackend {
pub fn new(client: Arc<Client>) -> Self {
Self { inner: client }
}
fn unimpl(&self) -> common_meta::error::Error {
let ret: common_meta::error::Result<()> = ReadOnlyKvBackendSnafu {
name: self.name().to_string(),
}
.fail()
.map_err(BoxedError::new)
.context(common_meta::error::ExternalSnafu);
ret.unwrap_err()
}
}
impl TxnService for ClusterKvBackend {
type Error = common_meta::error::Error;
}
#[async_trait::async_trait]
impl KvBackend for ClusterKvBackend {
fn name(&self) -> &str {
"ClusterKvBackend"
}
fn as_any(&self) -> &dyn Any {
self
}
async fn range(&self, req: RangeRequest) -> common_meta::error::Result<RangeResponse> {
self.inner
.range(req)
.await
.map_err(BoxedError::new)
.context(common_meta::error::ExternalSnafu)
}
async fn batch_get(&self, _: BatchGetRequest) -> common_meta::error::Result<BatchGetResponse> {
Err(self.unimpl())
}
async fn put(&self, _: PutRequest) -> common_meta::error::Result<PutResponse> {
Err(self.unimpl())
}
async fn batch_put(&self, _: BatchPutRequest) -> common_meta::error::Result<BatchPutResponse> {
Err(self.unimpl())
}
async fn delete_range(
&self,
_: DeleteRangeRequest,
) -> common_meta::error::Result<DeleteRangeResponse> {
Err(self.unimpl())
}
async fn batch_delete(
&self,
_: BatchDeleteRequest,
) -> common_meta::error::Result<BatchDeleteResponse> {
Err(self.unimpl())
}
}

View File

@@ -99,8 +99,22 @@ pub enum Error {
source: common_meta::error::Error,
},
#[snafu(display("Failed to get flow stat"))]
GetFlowStat {
#[snafu(implicit)]
location: Location,
source: common_meta::error::Error,
},
#[snafu(display("Retry exceeded max times({}), message: {}", times, msg))]
RetryTimesExceeded { times: usize, msg: String },
#[snafu(display("Trying to write to a read-only kv backend: {}", name))]
ReadOnlyKvBackend {
name: String,
#[snafu(implicit)]
location: Location,
},
}
#[allow(dead_code)]
@@ -120,13 +134,15 @@ impl ErrorExt for Error {
| Error::SendHeartbeat { .. }
| Error::CreateHeartbeatStream { .. }
| Error::CreateChannel { .. }
| Error::RetryTimesExceeded { .. } => StatusCode::Internal,
| Error::RetryTimesExceeded { .. }
| Error::ReadOnlyKvBackend { .. } => StatusCode::Internal,
Error::MetaServer { code, .. } => *code,
Error::InvalidResponseHeader { source, .. }
| Error::ConvertMetaRequest { source, .. }
| Error::ConvertMetaResponse { source, .. } => source.status_code(),
| Error::ConvertMetaResponse { source, .. }
| Error::GetFlowStat { source, .. } => source.status_code(),
}
}
}

View File

@@ -716,6 +716,13 @@ pub enum Error {
#[snafu(implicit)]
location: Location,
},
#[snafu(display("Flow state handler error"))]
FlowStateHandler {
#[snafu(implicit)]
location: Location,
source: common_meta::error::Error,
},
}
impl Error {
@@ -761,7 +768,8 @@ impl ErrorExt for Error {
| Error::Join { .. }
| Error::PeerUnavailable { .. }
| Error::ExceededDeadline { .. }
| Error::ChooseItems { .. } => StatusCode::Internal,
| Error::ChooseItems { .. }
| Error::FlowStateHandler { .. } => StatusCode::Internal,
Error::Unsupported { .. } => StatusCode::Unsupported,

View File

@@ -51,6 +51,7 @@ use tokio::sync::mpsc::Sender;
use tokio::sync::{oneshot, Notify, RwLock};
use crate::error::{self, DeserializeFromJsonSnafu, Result, UnexpectedInstructionReplySnafu};
use crate::handler::flow_state_handler::FlowStateHandler;
use crate::metasrv::Context;
use crate::metrics::{METRIC_META_HANDLER_EXECUTE, METRIC_META_HEARTBEAT_CONNECTION_NUM};
use crate::pubsub::PublisherRef;
@@ -64,6 +65,7 @@ pub mod collect_stats_handler;
pub mod extract_stat_handler;
pub mod failure_handler;
pub mod filter_inactive_region_stats;
pub mod flow_state_handler;
pub mod keep_lease_handler;
pub mod mailbox_handler;
pub mod on_leader_start_handler;
@@ -482,6 +484,8 @@ pub struct HeartbeatHandlerGroupBuilder {
/// based on the number of received heartbeats. When the number of heartbeats
/// reaches this factor, a flush operation is triggered.
flush_stats_factor: Option<usize>,
/// A simple handler for flow internal state report
flow_state_handler: Option<FlowStateHandler>,
/// The plugins.
plugins: Option<Plugins>,
@@ -499,12 +503,18 @@ impl HeartbeatHandlerGroupBuilder {
region_failure_handler: None,
region_lease_handler: None,
flush_stats_factor: None,
flow_state_handler: None,
plugins: None,
pushers,
handlers: vec![],
}
}
pub fn with_flow_state_handler(mut self, handler: Option<FlowStateHandler>) -> Self {
self.flow_state_handler = handler;
self
}
pub fn with_region_lease_handler(mut self, handler: Option<RegionLeaseHandler>) -> Self {
self.region_lease_handler = handler;
self
@@ -564,6 +574,10 @@ impl HeartbeatHandlerGroupBuilder {
}
self.add_handler_last(CollectStatsHandler::new(self.flush_stats_factor));
if let Some(flow_state_handler) = self.flow_state_handler.take() {
self.add_handler_last(flow_state_handler);
}
self
}

View File

@@ -0,0 +1,58 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use api::v1::meta::{FlowStat, HeartbeatRequest, Role};
use common_meta::key::flow::flow_state::{FlowStateManager, FlowStateValue};
use snafu::ResultExt;
use crate::error::{FlowStateHandlerSnafu, Result};
use crate::handler::{HandleControl, HeartbeatAccumulator, HeartbeatHandler};
use crate::metasrv::Context;
pub struct FlowStateHandler {
flow_state_manager: FlowStateManager,
}
impl FlowStateHandler {
pub fn new(flow_state_manager: FlowStateManager) -> Self {
Self { flow_state_manager }
}
}
#[async_trait::async_trait]
impl HeartbeatHandler for FlowStateHandler {
fn is_acceptable(&self, role: Role) -> bool {
role == Role::Flownode
}
async fn handle(
&self,
req: &HeartbeatRequest,
_ctx: &mut Context,
_acc: &mut HeartbeatAccumulator,
) -> Result<HandleControl> {
if let Some(FlowStat { flow_stat_size }) = &req.flow_stat {
let state_size = flow_stat_size
.iter()
.map(|(k, v)| (*k, *v as usize))
.collect();
let value = FlowStateValue::new(state_size);
self.flow_state_manager
.put(value)
.await
.context(FlowStateHandlerSnafu)?;
}
Ok(HandleControl::Continue)
}
}

View File

@@ -26,6 +26,7 @@ use common_meta::ddl::{
};
use common_meta::ddl_manager::DdlManager;
use common_meta::distributed_time_constants;
use common_meta::key::flow::flow_state::FlowStateManager;
use common_meta::key::flow::FlowMetadataManager;
use common_meta::key::maintenance::MaintenanceModeManager;
use common_meta::key::TableMetadataManager;
@@ -47,6 +48,7 @@ use crate::error::{self, Result};
use crate::flow_meta_alloc::FlowPeerAllocator;
use crate::greptimedb_telemetry::get_greptimedb_telemetry_task;
use crate::handler::failure_handler::RegionFailureHandler;
use crate::handler::flow_state_handler::FlowStateHandler;
use crate::handler::region_lease_handler::RegionLeaseHandler;
use crate::handler::{HeartbeatHandlerGroupBuilder, HeartbeatMailbox, Pushers};
use crate::lease::MetaPeerLookupService;
@@ -228,6 +230,7 @@ impl MetasrvBuilder {
peer_allocator,
))
});
let flow_metadata_allocator = {
// for now flownode just use round-robin selector
let flow_selector = RoundRobinSelector::new(SelectTarget::Flownode);
@@ -248,6 +251,9 @@ impl MetasrvBuilder {
peer_allocator,
))
};
let flow_state_handler =
FlowStateHandler::new(FlowStateManager::new(in_memory.clone().as_kv_backend_ref()));
let memory_region_keeper = Arc::new(MemoryRegionKeeper::default());
let node_manager = node_manager.unwrap_or_else(|| {
let datanode_client_channel_config = ChannelConfig::new()
@@ -350,6 +356,7 @@ impl MetasrvBuilder {
.with_region_failure_handler(region_failover_handler)
.with_region_lease_handler(Some(region_lease_handler))
.with_flush_stats_factor(Some(options.flush_stats_factor))
.with_flow_state_handler(Some(flow_state_handler))
.add_default_handlers()
}
};

View File

@@ -102,7 +102,7 @@ impl LeaderCachedKvBackend {
self.store.clone(),
RangeRequest::new().with_prefix(prefix.as_bytes()),
DEFAULT_PAGE_SIZE,
Arc::new(Ok),
Ok,
)
.into_stream();
@@ -386,6 +386,10 @@ impl ResettableKvBackend for LeaderCachedKvBackend {
fn reset(&self) {
self.cache.reset()
}
fn as_kv_backend_ref(self: Arc<Self>) -> KvBackendRef<Self::Error> {
self
}
}
#[cfg(test)]

View File

@@ -37,7 +37,7 @@ use store_api::storage::{ConcreteDataType, RegionId, TimeSeriesRowSelector};
use crate::cache::cache_size::parquet_meta_size;
use crate::cache::file_cache::{FileType, IndexKey};
use crate::cache::index::{InvertedIndexCache, InvertedIndexCacheRef};
use crate::cache::index::inverted_index::{InvertedIndexCache, InvertedIndexCacheRef};
use crate::cache::write_cache::WriteCacheRef;
use crate::metrics::{CACHE_BYTES, CACHE_EVICTION, CACHE_HIT, CACHE_MISS};
use crate::read::Batch;

View File

@@ -12,168 +12,29 @@
// See the License for the specific language governing permissions and
// limitations under the License.
pub mod inverted_index;
use std::future::Future;
use std::hash::Hash;
use std::ops::Range;
use std::sync::Arc;
use api::v1::index::InvertedIndexMetas;
use async_trait::async_trait;
use bytes::Bytes;
use common_base::BitVec;
use index::inverted_index::error::DecodeFstSnafu;
use index::inverted_index::format::reader::InvertedIndexReader;
use index::inverted_index::FstMap;
use object_store::Buffer;
use prost::Message;
use snafu::ResultExt;
use crate::metrics::{CACHE_BYTES, CACHE_HIT, CACHE_MISS};
use crate::sst::file::FileId;
/// Metrics for index metadata.
const INDEX_METADATA_TYPE: &str = "index_metadata";
/// Metrics for index content.
const INDEX_CONTENT_TYPE: &str = "index_content";
/// Inverted index blob reader with cache.
pub struct CachedInvertedIndexBlobReader<R> {
file_id: FileId,
file_size: u64,
inner: R,
cache: InvertedIndexCacheRef,
}
impl<R> CachedInvertedIndexBlobReader<R> {
pub fn new(file_id: FileId, file_size: u64, inner: R, cache: InvertedIndexCacheRef) -> Self {
Self {
file_id,
file_size,
inner,
cache,
}
}
}
impl<R> CachedInvertedIndexBlobReader<R>
where
R: InvertedIndexReader,
{
/// Gets given range of index data from cache, and loads from source if the file
/// is not already cached.
async fn get_or_load(
&mut self,
offset: u64,
size: u32,
) -> index::inverted_index::error::Result<Vec<u8>> {
let keys =
IndexDataPageKey::generate_page_keys(self.file_id, offset, size, self.cache.page_size);
// Size is 0, return empty data.
if keys.is_empty() {
return Ok(Vec::new());
}
let mut data = Vec::with_capacity(keys.len());
data.resize(keys.len(), Bytes::new());
let mut cache_miss_range = vec![];
let mut cache_miss_idx = vec![];
let last_index = keys.len() - 1;
// TODO: Avoid copy as much as possible.
for (i, index) in keys.iter().enumerate() {
match self.cache.get_index(index) {
Some(page) => {
CACHE_HIT.with_label_values(&[INDEX_CONTENT_TYPE]).inc();
data[i] = page;
}
None => {
CACHE_MISS.with_label_values(&[INDEX_CONTENT_TYPE]).inc();
let base_offset = index.page_id * self.cache.page_size;
let pruned_size = if i == last_index {
prune_size(&keys, self.file_size, self.cache.page_size)
} else {
self.cache.page_size
};
cache_miss_range.push(base_offset..base_offset + pruned_size);
cache_miss_idx.push(i);
}
}
}
if !cache_miss_range.is_empty() {
let pages = self.inner.read_vec(&cache_miss_range).await?;
for (i, page) in cache_miss_idx.into_iter().zip(pages.into_iter()) {
let key = keys[i].clone();
data[i] = page.clone();
self.cache.put_index(key, page.clone());
}
}
let buffer = Buffer::from_iter(data.into_iter());
Ok(buffer
.slice(IndexDataPageKey::calculate_range(
offset,
size,
self.cache.page_size,
))
.to_vec())
}
}
#[async_trait]
impl<R: InvertedIndexReader> InvertedIndexReader for CachedInvertedIndexBlobReader<R> {
async fn range_read(
&mut self,
offset: u64,
size: u32,
) -> index::inverted_index::error::Result<Vec<u8>> {
self.inner.range_read(offset, size).await
}
async fn read_vec(
&mut self,
ranges: &[Range<u64>],
) -> index::inverted_index::error::Result<Vec<Bytes>> {
self.inner.read_vec(ranges).await
}
async fn metadata(&mut self) -> index::inverted_index::error::Result<Arc<InvertedIndexMetas>> {
if let Some(cached) = self.cache.get_index_metadata(self.file_id) {
CACHE_HIT.with_label_values(&[INDEX_METADATA_TYPE]).inc();
Ok(cached)
} else {
let meta = self.inner.metadata().await?;
self.cache.put_index_metadata(self.file_id, meta.clone());
CACHE_MISS.with_label_values(&[INDEX_METADATA_TYPE]).inc();
Ok(meta)
}
}
async fn fst(
&mut self,
offset: u64,
size: u32,
) -> index::inverted_index::error::Result<FstMap> {
self.get_or_load(offset, size)
.await
.and_then(|r| FstMap::new(r).context(DecodeFstSnafu))
}
async fn bitmap(
&mut self,
offset: u64,
size: u32,
) -> index::inverted_index::error::Result<BitVec> {
self.get_or_load(offset, size).await.map(BitVec::from_vec)
}
}
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct IndexMetadataKey {
file_id: FileId,
}
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct IndexDataPageKey {
file_id: FileId,
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct PageKey {
page_id: u64,
}
impl IndexDataPageKey {
impl PageKey {
/// Converts an offset to a page ID based on the page size.
fn calculate_page_id(offset: u64, page_size: u64) -> u64 {
offset / page_size
@@ -199,49 +60,60 @@ impl IndexDataPageKey {
start..end
}
/// Generates a vector of IndexKey instances for the pages that a given offset and size span.
fn generate_page_keys(file_id: FileId, offset: u64, size: u32, page_size: u64) -> Vec<Self> {
/// Generates a iterator of `IndexKey` for the pages that a given offset and size span.
fn generate_page_keys(offset: u64, size: u32, page_size: u64) -> impl Iterator<Item = Self> {
let start_page = Self::calculate_page_id(offset, page_size);
let total_pages = Self::calculate_page_count(offset, size, page_size);
(0..total_pages)
.map(|i| Self {
file_id,
page_id: start_page + i as u64,
})
.collect()
(0..total_pages).map(move |i| Self {
page_id: start_page + i as u64,
})
}
}
pub type InvertedIndexCacheRef = Arc<InvertedIndexCache>;
pub struct InvertedIndexCache {
/// Cache for inverted index metadata
index_metadata: moka::sync::Cache<IndexMetadataKey, Arc<InvertedIndexMetas>>,
/// Cache for inverted index content.
index: moka::sync::Cache<IndexDataPageKey, Bytes>,
/// Cache for index metadata and content.
pub struct IndexCache<K, M> {
/// Cache for index metadata
index_metadata: moka::sync::Cache<K, Arc<M>>,
/// Cache for index content.
index: moka::sync::Cache<(K, PageKey), Bytes>,
// Page size for index content.
page_size: u64,
/// Weighter for metadata.
weight_of_metadata: fn(&K, &Arc<M>) -> u32,
/// Weighter for content.
weight_of_content: fn(&(K, PageKey), &Bytes) -> u32,
}
impl InvertedIndexCache {
/// Creates `InvertedIndexCache` with provided `index_metadata_cap` and `index_content_cap`.
pub fn new(index_metadata_cap: u64, index_content_cap: u64, page_size: u64) -> Self {
common_telemetry::debug!("Building InvertedIndexCache with metadata size: {index_metadata_cap}, content size: {index_content_cap}");
impl<K, M> IndexCache<K, M>
where
K: Hash + Eq + Send + Sync + 'static,
M: Send + Sync + 'static,
{
pub fn new_with_weighter(
index_metadata_cap: u64,
index_content_cap: u64,
page_size: u64,
index_type: &'static str,
weight_of_metadata: fn(&K, &Arc<M>) -> u32,
weight_of_content: fn(&(K, PageKey), &Bytes) -> u32,
) -> Self {
common_telemetry::debug!("Building IndexCache with metadata size: {index_metadata_cap}, content size: {index_content_cap}, page size: {page_size}, index type: {index_type}");
let index_metadata = moka::sync::CacheBuilder::new(index_metadata_cap)
.name("inverted_index_metadata")
.weigher(index_metadata_weight)
.eviction_listener(|k, v, _cause| {
let size = index_metadata_weight(&k, &v);
.name(&format!("index_metadata_{}", index_type))
.weigher(weight_of_metadata)
.eviction_listener(move |k, v, _cause| {
let size = weight_of_metadata(&k, &v);
CACHE_BYTES
.with_label_values(&[INDEX_METADATA_TYPE])
.sub(size.into());
})
.build();
let index_cache = moka::sync::CacheBuilder::new(index_content_cap)
.name("inverted_index_content")
.weigher(index_content_weight)
.eviction_listener(|k, v, _cause| {
let size = index_content_weight(&k, &v);
.name(&format!("index_content_{}", index_type))
.weigher(weight_of_content)
.eviction_listener(move |k, v, _cause| {
let size = weight_of_content(&k, &v);
CACHE_BYTES
.with_label_values(&[INDEX_CONTENT_TYPE])
.sub(size.into());
@@ -251,259 +123,109 @@ impl InvertedIndexCache {
index_metadata,
index: index_cache,
page_size,
weight_of_content,
weight_of_metadata,
}
}
}
impl InvertedIndexCache {
pub fn get_index_metadata(&self, file_id: FileId) -> Option<Arc<InvertedIndexMetas>> {
self.index_metadata.get(&IndexMetadataKey { file_id })
impl<K, M> IndexCache<K, M>
where
K: Hash + Eq + Clone + Copy + Send + Sync + 'static,
M: Send + Sync + 'static,
{
pub fn get_metadata(&self, key: K) -> Option<Arc<M>> {
self.index_metadata.get(&key)
}
pub fn put_index_metadata(&self, file_id: FileId, metadata: Arc<InvertedIndexMetas>) {
let key = IndexMetadataKey { file_id };
pub fn put_metadata(&self, key: K, metadata: Arc<M>) {
CACHE_BYTES
.with_label_values(&[INDEX_METADATA_TYPE])
.add(index_metadata_weight(&key, &metadata).into());
.add((self.weight_of_metadata)(&key, &metadata).into());
self.index_metadata.insert(key, metadata)
}
pub fn get_index(&self, key: &IndexDataPageKey) -> Option<Bytes> {
self.index.get(key)
/// Gets given range of index data from cache, and loads from source if the file
/// is not already cached.
async fn get_or_load<F, Fut, E>(
&self,
key: K,
file_size: u64,
offset: u64,
size: u32,
load: F,
) -> Result<Vec<u8>, E>
where
F: FnOnce(Vec<Range<u64>>) -> Fut,
Fut: Future<Output = Result<Vec<Bytes>, E>>,
E: std::error::Error,
{
let page_keys =
PageKey::generate_page_keys(offset, size, self.page_size).collect::<Vec<_>>();
// Size is 0, return empty data.
if page_keys.is_empty() {
return Ok(Vec::new());
}
let mut data = Vec::with_capacity(page_keys.len());
data.resize(page_keys.len(), Bytes::new());
let mut cache_miss_range = vec![];
let mut cache_miss_idx = vec![];
let last_index = page_keys.len() - 1;
// TODO: Avoid copy as much as possible.
for (i, page_key) in page_keys.iter().enumerate() {
match self.get_page(key, *page_key) {
Some(page) => {
CACHE_HIT.with_label_values(&[INDEX_CONTENT_TYPE]).inc();
data[i] = page;
}
None => {
CACHE_MISS.with_label_values(&[INDEX_CONTENT_TYPE]).inc();
let base_offset = page_key.page_id * self.page_size;
let pruned_size = if i == last_index {
prune_size(page_keys.iter(), file_size, self.page_size)
} else {
self.page_size
};
cache_miss_range.push(base_offset..base_offset + pruned_size);
cache_miss_idx.push(i);
}
}
}
if !cache_miss_range.is_empty() {
let pages = load(cache_miss_range).await?;
for (i, page) in cache_miss_idx.into_iter().zip(pages.into_iter()) {
let page_key = page_keys[i];
data[i] = page.clone();
self.put_page(key, page_key, page.clone());
}
}
let buffer = Buffer::from_iter(data.into_iter());
Ok(buffer
.slice(PageKey::calculate_range(offset, size, self.page_size))
.to_vec())
}
pub fn put_index(&self, key: IndexDataPageKey, value: Bytes) {
fn get_page(&self, key: K, page_key: PageKey) -> Option<Bytes> {
self.index.get(&(key, page_key))
}
fn put_page(&self, key: K, page_key: PageKey, value: Bytes) {
CACHE_BYTES
.with_label_values(&[INDEX_CONTENT_TYPE])
.add(index_content_weight(&key, &value).into());
self.index.insert(key, value);
.add((self.weight_of_content)(&(key, page_key), &value).into());
self.index.insert((key, page_key), value);
}
}
/// Calculates weight for index metadata.
fn index_metadata_weight(k: &IndexMetadataKey, v: &Arc<InvertedIndexMetas>) -> u32 {
(k.file_id.as_bytes().len() + v.encoded_len()) as u32
}
/// Calculates weight for index content.
fn index_content_weight(k: &IndexDataPageKey, v: &Bytes) -> u32 {
(k.file_id.as_bytes().len() + v.len()) as u32
}
/// Prunes the size of the last page based on the indexes.
/// We have following cases:
/// 1. The rest file size is less than the page size, read to the end of the file.
/// 2. Otherwise, read the page size.
fn prune_size(indexes: &[IndexDataPageKey], file_size: u64, page_size: u64) -> u64 {
fn prune_size<'a>(
indexes: impl Iterator<Item = &'a PageKey>,
file_size: u64,
page_size: u64,
) -> u64 {
let last_page_start = indexes.last().map(|i| i.page_id * page_size).unwrap_or(0);
page_size.min(file_size - last_page_start)
}
#[cfg(test)]
mod test {
use std::num::NonZeroUsize;
use common_base::BitVec;
use futures::stream;
use index::inverted_index::format::reader::{InvertedIndexBlobReader, InvertedIndexReader};
use index::inverted_index::format::writer::{InvertedIndexBlobWriter, InvertedIndexWriter};
use index::inverted_index::Bytes;
use prometheus::register_int_counter_vec;
use rand::{Rng, RngCore};
use super::*;
use crate::sst::index::store::InstrumentedStore;
use crate::test_util::TestEnv;
// Repeat times for following little fuzz tests.
const FUZZ_REPEAT_TIMES: usize = 100;
// Fuzz test for index data page key
#[test]
fn fuzz_index_calculation() {
// randomly generate a large u8 array
let mut rng = rand::thread_rng();
let mut data = vec![0u8; 1024 * 1024];
rng.fill_bytes(&mut data);
let file_id = FileId::random();
for _ in 0..FUZZ_REPEAT_TIMES {
let offset = rng.gen_range(0..data.len() as u64);
let size = rng.gen_range(0..data.len() as u32 - offset as u32);
let page_size: usize = rng.gen_range(1..1024);
let indexes =
IndexDataPageKey::generate_page_keys(file_id, offset, size, page_size as u64);
let page_num = indexes.len();
let mut read = Vec::with_capacity(size as usize);
for key in indexes.into_iter() {
let start = key.page_id as usize * page_size;
let page = if start + page_size < data.len() {
&data[start..start + page_size]
} else {
&data[start..]
};
read.extend_from_slice(page);
}
let expected_range = offset as usize..(offset + size as u64 as u64) as usize;
let read =
read[IndexDataPageKey::calculate_range(offset, size, page_size as u64)].to_vec();
if read != data.get(expected_range).unwrap() {
panic!(
"fuzz_read_index failed, offset: {}, size: {}, page_size: {}\nread len: {}, expected len: {}\nrange: {:?}, page num: {}",
offset, size, page_size, read.len(), size as usize,
IndexDataPageKey::calculate_range(offset, size, page_size as u64),
page_num
);
}
}
}
fn unpack(fst_value: u64) -> [u32; 2] {
bytemuck::cast::<u64, [u32; 2]>(fst_value)
}
async fn create_inverted_index_blob() -> Vec<u8> {
let mut blob = Vec::new();
let mut writer = InvertedIndexBlobWriter::new(&mut blob);
writer
.add_index(
"tag0".to_string(),
BitVec::from_slice(&[0b0000_0001, 0b0000_0000]),
Box::new(stream::iter(vec![
Ok((Bytes::from("a"), BitVec::from_slice(&[0b0000_0001]))),
Ok((Bytes::from("b"), BitVec::from_slice(&[0b0010_0000]))),
Ok((Bytes::from("c"), BitVec::from_slice(&[0b0000_0001]))),
])),
)
.await
.unwrap();
writer
.add_index(
"tag1".to_string(),
BitVec::from_slice(&[0b0000_0001, 0b0000_0000]),
Box::new(stream::iter(vec![
Ok((Bytes::from("x"), BitVec::from_slice(&[0b0000_0001]))),
Ok((Bytes::from("y"), BitVec::from_slice(&[0b0010_0000]))),
Ok((Bytes::from("z"), BitVec::from_slice(&[0b0000_0001]))),
])),
)
.await
.unwrap();
writer
.finish(8, NonZeroUsize::new(1).unwrap())
.await
.unwrap();
blob
}
#[tokio::test]
async fn test_inverted_index_cache() {
let blob = create_inverted_index_blob().await;
// Init a test range reader in local fs.
let mut env = TestEnv::new();
let file_size = blob.len() as u64;
let store = env.init_object_store_manager();
let temp_path = "data";
store.write(temp_path, blob).await.unwrap();
let store = InstrumentedStore::new(store);
let metric =
register_int_counter_vec!("test_bytes", "a counter for test", &["test"]).unwrap();
let counter = metric.with_label_values(&["test"]);
let range_reader = store
.range_reader("data", &counter, &counter)
.await
.unwrap();
let reader = InvertedIndexBlobReader::new(range_reader);
let mut cached_reader = CachedInvertedIndexBlobReader::new(
FileId::random(),
file_size,
reader,
Arc::new(InvertedIndexCache::new(8192, 8192, 50)),
);
let metadata = cached_reader.metadata().await.unwrap();
assert_eq!(metadata.total_row_count, 8);
assert_eq!(metadata.segment_row_count, 1);
assert_eq!(metadata.metas.len(), 2);
// tag0
let tag0 = metadata.metas.get("tag0").unwrap();
let stats0 = tag0.stats.as_ref().unwrap();
assert_eq!(stats0.distinct_count, 3);
assert_eq!(stats0.null_count, 1);
assert_eq!(stats0.min_value, Bytes::from("a"));
assert_eq!(stats0.max_value, Bytes::from("c"));
let fst0 = cached_reader
.fst(
tag0.base_offset + tag0.relative_fst_offset as u64,
tag0.fst_size,
)
.await
.unwrap();
assert_eq!(fst0.len(), 3);
let [offset, size] = unpack(fst0.get(b"a").unwrap());
let bitmap = cached_reader
.bitmap(tag0.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0000_0001]));
let [offset, size] = unpack(fst0.get(b"b").unwrap());
let bitmap = cached_reader
.bitmap(tag0.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0010_0000]));
let [offset, size] = unpack(fst0.get(b"c").unwrap());
let bitmap = cached_reader
.bitmap(tag0.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0000_0001]));
// tag1
let tag1 = metadata.metas.get("tag1").unwrap();
let stats1 = tag1.stats.as_ref().unwrap();
assert_eq!(stats1.distinct_count, 3);
assert_eq!(stats1.null_count, 1);
assert_eq!(stats1.min_value, Bytes::from("x"));
assert_eq!(stats1.max_value, Bytes::from("z"));
let fst1 = cached_reader
.fst(
tag1.base_offset + tag1.relative_fst_offset as u64,
tag1.fst_size,
)
.await
.unwrap();
assert_eq!(fst1.len(), 3);
let [offset, size] = unpack(fst1.get(b"x").unwrap());
let bitmap = cached_reader
.bitmap(tag1.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0000_0001]));
let [offset, size] = unpack(fst1.get(b"y").unwrap());
let bitmap = cached_reader
.bitmap(tag1.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0010_0000]));
let [offset, size] = unpack(fst1.get(b"z").unwrap());
let bitmap = cached_reader
.bitmap(tag1.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0000_0001]));
// fuzz test
let mut rng = rand::thread_rng();
for _ in 0..FUZZ_REPEAT_TIMES {
let offset = rng.gen_range(0..file_size);
let size = rng.gen_range(0..file_size as u32 - offset as u32);
let expected = cached_reader.range_read(offset, size).await.unwrap();
let read = cached_reader.get_or_load(offset, size).await.unwrap();
assert_eq!(read, expected);
}
}
}

View File

@@ -0,0 +1,322 @@
// Copyright 2023 Greptime Team
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
use std::sync::Arc;
use api::v1::index::InvertedIndexMetas;
use async_trait::async_trait;
use bytes::Bytes;
use index::inverted_index::error::Result;
use index::inverted_index::format::reader::InvertedIndexReader;
use prost::Message;
use crate::cache::index::{IndexCache, PageKey, INDEX_METADATA_TYPE};
use crate::metrics::{CACHE_HIT, CACHE_MISS};
use crate::sst::file::FileId;
const INDEX_TYPE_INVERTED_INDEX: &str = "inverted_index";
/// Cache for inverted index.
pub type InvertedIndexCache = IndexCache<FileId, InvertedIndexMetas>;
pub type InvertedIndexCacheRef = Arc<InvertedIndexCache>;
impl InvertedIndexCache {
/// Creates a new inverted index cache.
pub fn new(index_metadata_cap: u64, index_content_cap: u64, page_size: u64) -> Self {
Self::new_with_weighter(
index_metadata_cap,
index_content_cap,
page_size,
INDEX_TYPE_INVERTED_INDEX,
inverted_index_metadata_weight,
inverted_index_content_weight,
)
}
}
/// Calculates weight for inverted index metadata.
fn inverted_index_metadata_weight(k: &FileId, v: &Arc<InvertedIndexMetas>) -> u32 {
(k.as_bytes().len() + v.encoded_len()) as u32
}
/// Calculates weight for inverted index content.
fn inverted_index_content_weight((k, _): &(FileId, PageKey), v: &Bytes) -> u32 {
(k.as_bytes().len() + v.len()) as u32
}
/// Inverted index blob reader with cache.
pub struct CachedInvertedIndexBlobReader<R> {
file_id: FileId,
file_size: u64,
inner: R,
cache: InvertedIndexCacheRef,
}
impl<R> CachedInvertedIndexBlobReader<R> {
/// Creates a new inverted index blob reader with cache.
pub fn new(file_id: FileId, file_size: u64, inner: R, cache: InvertedIndexCacheRef) -> Self {
Self {
file_id,
file_size,
inner,
cache,
}
}
}
#[async_trait]
impl<R: InvertedIndexReader> InvertedIndexReader for CachedInvertedIndexBlobReader<R> {
async fn range_read(&mut self, offset: u64, size: u32) -> Result<Vec<u8>> {
let inner = &mut self.inner;
self.cache
.get_or_load(
self.file_id,
self.file_size,
offset,
size,
move |ranges| async move { inner.read_vec(&ranges).await },
)
.await
}
async fn metadata(&mut self) -> Result<Arc<InvertedIndexMetas>> {
if let Some(cached) = self.cache.get_metadata(self.file_id) {
CACHE_HIT.with_label_values(&[INDEX_METADATA_TYPE]).inc();
Ok(cached)
} else {
let meta = self.inner.metadata().await?;
self.cache.put_metadata(self.file_id, meta.clone());
CACHE_MISS.with_label_values(&[INDEX_METADATA_TYPE]).inc();
Ok(meta)
}
}
}
#[cfg(test)]
mod test {
use std::num::NonZeroUsize;
use common_base::BitVec;
use futures::stream;
use index::inverted_index::format::reader::{InvertedIndexBlobReader, InvertedIndexReader};
use index::inverted_index::format::writer::{InvertedIndexBlobWriter, InvertedIndexWriter};
use index::inverted_index::Bytes;
use prometheus::register_int_counter_vec;
use rand::{Rng, RngCore};
use super::*;
use crate::sst::index::store::InstrumentedStore;
use crate::test_util::TestEnv;
// Repeat times for following little fuzz tests.
const FUZZ_REPEAT_TIMES: usize = 100;
// Fuzz test for index data page key
#[test]
fn fuzz_index_calculation() {
// randomly generate a large u8 array
let mut rng = rand::thread_rng();
let mut data = vec![0u8; 1024 * 1024];
rng.fill_bytes(&mut data);
for _ in 0..FUZZ_REPEAT_TIMES {
let offset = rng.gen_range(0..data.len() as u64);
let size = rng.gen_range(0..data.len() as u32 - offset as u32);
let page_size: usize = rng.gen_range(1..1024);
let indexes =
PageKey::generate_page_keys(offset, size, page_size as u64).collect::<Vec<_>>();
let page_num = indexes.len();
let mut read = Vec::with_capacity(size as usize);
for key in indexes.into_iter() {
let start = key.page_id as usize * page_size;
let page = if start + page_size < data.len() {
&data[start..start + page_size]
} else {
&data[start..]
};
read.extend_from_slice(page);
}
let expected_range = offset as usize..(offset + size as u64 as u64) as usize;
let read = read[PageKey::calculate_range(offset, size, page_size as u64)].to_vec();
if read != data.get(expected_range).unwrap() {
panic!(
"fuzz_read_index failed, offset: {}, size: {}, page_size: {}\nread len: {}, expected len: {}\nrange: {:?}, page num: {}",
offset, size, page_size, read.len(), size as usize,
PageKey::calculate_range(offset, size, page_size as u64),
page_num
);
}
}
}
fn unpack(fst_value: u64) -> [u32; 2] {
bytemuck::cast::<u64, [u32; 2]>(fst_value)
}
async fn create_inverted_index_blob() -> Vec<u8> {
let mut blob = Vec::new();
let mut writer = InvertedIndexBlobWriter::new(&mut blob);
writer
.add_index(
"tag0".to_string(),
BitVec::from_slice(&[0b0000_0001, 0b0000_0000]),
Box::new(stream::iter(vec![
Ok((Bytes::from("a"), BitVec::from_slice(&[0b0000_0001]))),
Ok((Bytes::from("b"), BitVec::from_slice(&[0b0010_0000]))),
Ok((Bytes::from("c"), BitVec::from_slice(&[0b0000_0001]))),
])),
)
.await
.unwrap();
writer
.add_index(
"tag1".to_string(),
BitVec::from_slice(&[0b0000_0001, 0b0000_0000]),
Box::new(stream::iter(vec![
Ok((Bytes::from("x"), BitVec::from_slice(&[0b0000_0001]))),
Ok((Bytes::from("y"), BitVec::from_slice(&[0b0010_0000]))),
Ok((Bytes::from("z"), BitVec::from_slice(&[0b0000_0001]))),
])),
)
.await
.unwrap();
writer
.finish(8, NonZeroUsize::new(1).unwrap())
.await
.unwrap();
blob
}
#[tokio::test]
async fn test_inverted_index_cache() {
let blob = create_inverted_index_blob().await;
// Init a test range reader in local fs.
let mut env = TestEnv::new();
let file_size = blob.len() as u64;
let store = env.init_object_store_manager();
let temp_path = "data";
store.write(temp_path, blob).await.unwrap();
let store = InstrumentedStore::new(store);
let metric =
register_int_counter_vec!("test_bytes", "a counter for test", &["test"]).unwrap();
let counter = metric.with_label_values(&["test"]);
let range_reader = store
.range_reader("data", &counter, &counter)
.await
.unwrap();
let reader = InvertedIndexBlobReader::new(range_reader);
let mut cached_reader = CachedInvertedIndexBlobReader::new(
FileId::random(),
file_size,
reader,
Arc::new(InvertedIndexCache::new(8192, 8192, 50)),
);
let metadata = cached_reader.metadata().await.unwrap();
assert_eq!(metadata.total_row_count, 8);
assert_eq!(metadata.segment_row_count, 1);
assert_eq!(metadata.metas.len(), 2);
// tag0
let tag0 = metadata.metas.get("tag0").unwrap();
let stats0 = tag0.stats.as_ref().unwrap();
assert_eq!(stats0.distinct_count, 3);
assert_eq!(stats0.null_count, 1);
assert_eq!(stats0.min_value, Bytes::from("a"));
assert_eq!(stats0.max_value, Bytes::from("c"));
let fst0 = cached_reader
.fst(
tag0.base_offset + tag0.relative_fst_offset as u64,
tag0.fst_size,
)
.await
.unwrap();
assert_eq!(fst0.len(), 3);
let [offset, size] = unpack(fst0.get(b"a").unwrap());
let bitmap = cached_reader
.bitmap(tag0.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0000_0001]));
let [offset, size] = unpack(fst0.get(b"b").unwrap());
let bitmap = cached_reader
.bitmap(tag0.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0010_0000]));
let [offset, size] = unpack(fst0.get(b"c").unwrap());
let bitmap = cached_reader
.bitmap(tag0.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0000_0001]));
// tag1
let tag1 = metadata.metas.get("tag1").unwrap();
let stats1 = tag1.stats.as_ref().unwrap();
assert_eq!(stats1.distinct_count, 3);
assert_eq!(stats1.null_count, 1);
assert_eq!(stats1.min_value, Bytes::from("x"));
assert_eq!(stats1.max_value, Bytes::from("z"));
let fst1 = cached_reader
.fst(
tag1.base_offset + tag1.relative_fst_offset as u64,
tag1.fst_size,
)
.await
.unwrap();
assert_eq!(fst1.len(), 3);
let [offset, size] = unpack(fst1.get(b"x").unwrap());
let bitmap = cached_reader
.bitmap(tag1.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0000_0001]));
let [offset, size] = unpack(fst1.get(b"y").unwrap());
let bitmap = cached_reader
.bitmap(tag1.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0010_0000]));
let [offset, size] = unpack(fst1.get(b"z").unwrap());
let bitmap = cached_reader
.bitmap(tag1.base_offset + offset as u64, size)
.await
.unwrap();
assert_eq!(bitmap, BitVec::from_slice(&[0b0000_0001]));
// fuzz test
let mut rng = rand::thread_rng();
for _ in 0..FUZZ_REPEAT_TIMES {
let offset = rng.gen_range(0..file_size);
let size = rng.gen_range(0..file_size as u32 - offset as u32);
let expected = cached_reader.range_read(offset, size).await.unwrap();
let inner = &mut cached_reader.inner;
let read = cached_reader
.cache
.get_or_load(
cached_reader.file_id,
file_size,
offset,
size,
|ranges| async move { inner.read_vec(&ranges).await },
)
.await
.unwrap();
assert_eq!(read, expected);
}
}
}

View File

@@ -271,11 +271,11 @@ impl CompactionScheduler {
current_version.options.ttl,
&schema_metadata_manager,
)
.await
.unwrap_or_else(|e| {
warn!(e; "Failed to get ttl for region: {}", region_id);
TimeToLive::default()
});
.await
.unwrap_or_else(|e| {
warn!(e; "Failed to get ttl for region: {}", region_id);
TimeToLive::default()
});
debug!(
"Pick compaction strategy {:?} for region: {}, ttl: {:?}",
@@ -351,7 +351,7 @@ impl CompactionScheduler {
job_id: None,
reason: e.reason,
}
.fail();
.fail();
}
error!(e; "Failed to schedule remote compaction job for region {}, fallback to local compaction", region_id);

View File

@@ -723,10 +723,20 @@ pub enum Error {
#[snafu(display("Failed to iter data part"))]
ReadDataPart {
#[snafu(implicit)]
location: Location,
#[snafu(source)]
error: parquet::errors::ParquetError,
},
#[snafu(display("Failed to read row group in memtable"))]
DecodeArrowRowGroup {
#[snafu(source)]
error: ArrowError,
#[snafu(implicit)]
location: Location,
},
#[snafu(display("Invalid region options, {}", reason))]
InvalidRegionOptions {
reason: String,
@@ -1029,6 +1039,7 @@ impl ErrorExt for Error {
RegionBusy { .. } => StatusCode::RegionBusy,
GetSchemaMetadata { source, .. } => source.status_code(),
Timeout { .. } => StatusCode::Cancelled,
DecodeArrowRowGroup { .. } => StatusCode::Internal,
}
}

Some files were not shown because too many files have changed in this diff Show More