Lei, HUANG b57dfc18dc feat: pending rows batching for metrics (#7831)
* feat: metric batch 2s PoC

Signed-off-by: jeremyhi <fengjiachun@gmail.com>

* chore: max_concurrent_flushes

Signed-off-by: jeremyhi <fengjiachun@gmail.com>

* chore: work channel size

Signed-off-by: jeremyhi <fengjiachun@gmail.com>

* feat(servers): add metrics and logs for pending rows batch flush

Add the `FLUSH_ELAPSED` histogram metric to track the duration of pending
rows batch flushes in the Prometheus store protocol handler. This provides
better observability into the performance and latency of the batcher.

Also update telemetry by:
- Recording elapsed time for both successful and failed flush operations.
- Adding an informational log upon successful flush including row count and duration.
- Including elapsed time in error logs when a flush fails.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat(servers): implement columnar batching for pending rows

Refactor PendingRowsBatcher to use columnar batching for the metrics
store. Incoming RowInsertRequests are now converted to RecordBatches,
partitioned, and flushed via BulkInsert requests to datanodes.

- Enhance MultiDimPartitionRule to handle scalar boolean predicates.
- Add metrics for tracking flush failures and dropped rows.
- Update dependencies to support columnar batching in servers.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat(servers): add backpressure for pending rows

Implement backpressure in PendingRowsBatcher by limiting in-flight
requests with a semaphore and making the submission wait for the flush
result. This ensures Prometheus write requests are throttled and only
return once the data has been successfully flushed to datanodes.

- Add max_inflight_requests to PromStoreOptions.
- Use oneshot channels to notify submitters of flush completion.
- Limit concurrent requests using a new inflight_semaphore.
- Update PendingRowsBatcher::submit to wait for the flush outcome.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat: add stage-level metrics for bulk ingestion

Introduce histograms to track the elapsed time of various stages in the
metric engine bulk insert path and the server's pending rows batcher.
This provides better observability into the performance bottlenecks
of the ingestion pipeline.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* - `src/metric-engine/src/engine/bulk_insert.rs`: Removed the fallback mechanism that converted record batches to rows when bulk inserts were unsupported, along with related helper functions and unused imports.
- `src/operator/src/insert.rs`: Removed an unused import (`common_time::TimeToLive::Instant`).

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat(servers): columnar Prom remote write

Optimize the Prometheus remote write path by allowing direct conversion
from decoded Prometheus samples to Arrow RecordBatches. This bypasses
intermediate row-based representations when `PendingRowsBatcher` is
active and no pipeline is used, improving ingestion efficiency.

- Implement `as_record_batch_groups` in `TablesBuilder` and `PromWriteRequest`.
- Add `submit_prom_record_batch_groups` to `PendingRowsBatcher`.
- Introduce `DecodedPromWriteRequest` in `prom_store`.
- Implement row-to-RecordBatch conversion logic in `prom_row_builder`.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* Revert "feat(servers): columnar Prom remote write"

This reverts commit efbb63c12a3e7fcec03858ea0351efd94fec8242.

* refactor(servers): improve row to RecordBatch conversion

- Use `snafu::ensure` for row validation in `rows_to_record_batch`.
- Add explicit type hint for `MutableVector` to improve clarity.
- Reorganize and clean up imports in `pending_rows_batcher.rs`.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* perf(servers): use arrow builders for row conversion

This commit optimizes the conversion from `api::v1::Rows` to `RecordBatch`
by using Arrow builders directly. This avoids the overhead of
`MutableVector` and `common_recordbatch`, leading to better performance
in the `pending_rows_batcher`.

Additionally, the `#[allow(dead_code)]` attribute is removed from
`modify_batch_sparse` in the metric engine as it is now utilized.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* perf(metric-engine): optimize batch modification

Optimize `modify_batch_sparse` by reusing buffers, using Arrow
builders, and employing fast-path encoding methods. This reduces
allocations and avoids redundant downcasting and serializer overhead.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/metric-engine-support-bulk:
 **Add Environment Variable for Batch Sync Control**

 - `pending_rows_batcher.rs`: Introduced an environment variable `PENDING_ROWS_BATCH_SYNC` to control the synchronization behavior of batch processing. If set to true, the function will wait for the flush result; otherwise, it will return immediatel
 with the total rows count.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* wip

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* chore: update and fix clippy

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* fix: failing test

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* picking-pending-rows-batcher:
 ### Commit Message

 Remove Unused Code and Simplify Error Handling

 - **`src/error.rs`**: Removed the `BatcherQueueFull` error variant and its associated logic, simplifying the error handling by removing unused code.
 - **`src/http/prom_store.rs`**: Eliminated the `try_decompress` function, streamlining the decompression logic by directly using `snappy_decompress` in `decode_remote_read_request`.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* chore: parse PENDING_ROWS_BATCH_SYNC once

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* chore: revert unrelated changes

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* **Refactor Prometheus Write Handling**

 - **`prom_store.rs`**: Introduced `pre_write` method in `PromStoreProtocolHandler` to handle pre-write checks for Prometheus remote write requests. Updated `write` method to utilize `pre_write`.
 - **`server.rs`**: Modified `PendingRowsBatcher` initialization to conditionally create a batcher based on `with_metric_engine` flag.
 - **`http/prom_store.rs`**: Integrated `pre_write` checks before submitting requests to `PendingRowsBatcher`.
 - **`query_handler.rs`**: Added `pre_write` method to `PromStoreProtocolHandler` trait for pre-write operations.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* picking-pending-rows-batcher:
 - **Fix Label Typo**: Corrected a typo in the label value from `"flush_wn ite_region"` to `"flush_write_region"` in `pending_rows_batcher.rs`.
 - **Refactor Array Building Logic**: Introduced a macro `build_array!` to streamline the construction of `ArrayRef` for different data types, reducing code duplication in `pending_rows_batcher.rs`.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* format toml

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* picking-pending-rows-batcher:
 ### Update PromStore and PendingRowsBatcher Configuration

 - **`prom_store.rs`**: Set `pending_rows_flush_interval` to `Duration::ZERO` to disable automatic flushing.
 - **`pending_rows_batcher.rs`**: Enhance validation to disable the batcher when `flush_interval` is zero or configuration values like `max_batch_rows`, `max_concurrent_flushes`, `worker_channel_capacity`, or `max_inflight_requests` are zero, preventing potential panics or deadlocks.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* picking-pending-rows-batcher:
 ### Update `pending_rows_flush_interval` to Zero

 - **Files Modified**:
   - `src/frontend/src/service_config/prom_store.rs`
   - `tests-integration/tests/http.rs`

 - **Key Changes**:
   - Updated `pending_rows_flush_interval` from `Duration::from_secs(2)` to `Duration::ZERO` in `prom_store.rs`.
   - Changed `pending_rows_flush_interval` configuration from `"2s"` to `"0s"` in `http.rs`.

 These changes set the flush interval to zero, potentially affecting how frequently pending rows are flushed.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* picking-pending-rows-batcher:
 **Add Worker Management Enhancements**

 - **`metrics.rs`**: Introduced `PENDING_WORKERS` gauge to track active pending rows batch workers.
 - **`pending_rows_batcher.rs`**:
   - Added worker idle timeout logic with `WORKER_IDLE_TIMEOUT_MULTIPLIER`.
   - Implemented worker management functions: `spawn_worker`, `remove_worker_if_same_channel`, and `should_close_worker_on_idle_timeout`.
   - Enhanced worker lifecycle management to handle idle workers and ensure proper cleanup.
 - **Tests**: Added unit tests for worker removal and idle timeout logic.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* fix: clippy

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

---------

Signed-off-by: jeremyhi <fengjiachun@gmail.com>
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
Co-authored-by: jeremyhi <fengjiachun@gmail.com>
2026-03-27 02:19:00 +00:00
2023-08-10 08:08:37 +00:00
2026-03-19 21:26:41 +00:00
2023-06-25 11:05:46 +08:00
2026-03-19 21:26:41 +00:00
2026-03-11 06:57:07 +00:00
2023-11-09 10:38:12 +00:00
2026-03-11 07:29:35 +00:00
2023-03-28 19:14:29 +08:00

GreptimeDB Logo

One database for metrics, logs, and traces
replacing Prometheus, Loki, and Elasticsearch

The unified OpenTelemetry backend — with SQL + PromQL on object storage.

Introduction

GreptimeDB is an open-source observability database built for Observability 2.0 — treating metrics, logs, and traces as one unified data model (wide events) instead of three separate pillars.

Use it as the single OpenTelemetry backend — replacing Prometheus, Loki, and Elasticsearch with one database built on object storage. Query with SQL and PromQL, scale without pain, cut costs up to 50x.

Features

Feature Description
Drop-in replacement PromQL, Prometheus remote write, Jaeger, and OpenTelemetry native. Use as your single backend for all three signals, or migrate one at a time.
50x lower cost Object storage (S3, GCS, Azure Blob etc.) as primary storage. Compute-storage separation scales without pain.
SQL + PromQL Monitor with PromQL, analyze with SQL. One database replaces Prometheus + your data warehouse.
Sub-second at PB-EB scale Columnar engine with fulltext, inverted, and skipping indexes. Written in Rust.

Perfect for:

  • Replacing Prometheus + Loki + Elasticsearch with one database
  • Scaling past Prometheus — high cardinality, long-term storage, no Thanos/Mimir overhead
  • Cutting observability costs with object storage (up to 50x savings on traces, 30% on logs)
  • AI/LLM observability — store and analyze high-volume conversation data, agent traces, and token metrics via OpenTelemetry GenAI conventions
  • Edge-to-cloud observability with unified APIs on resource-constrained devices

Why Observability 2.0? The three-pillar model (separate databases for metrics, logs, traces) creates data silos and operational complexity. GreptimeDB treats all observability data as timestamped wide events in a single columnar engine — enabling cross-signal SQL JOINs, eliminating redundant infrastructure, and naturally supporting emerging workloads like AI agent observability. Read more: Observability 2.0 and the Database for It.

Learn more in Why GreptimeDB.

How GreptimeDB Compares

Feature GreptimeDB Prometheus / Thanos / Mimir Grafana Loki Elasticsearch
Data types Metrics, logs, traces Metrics only Logs only Logs, traces
Query language SQL + PromQL PromQL LogQL Query DSL
Storage Native object storage (S3, etc.) Local disk + object storage (Thanos/Mimir) Object storage (chunks) Local disk
Scaling Compute-storage separation, stateless nodes Federation / Thanos / Mimir — multi-component, ops heavy Stateless + object storage Shard-based, ops heavy
Cost efficiency Up to 50x lower storage High at scale Moderate High (inverted index overhead)
OpenTelemetry Native (metrics + logs + traces) Partial (metrics only) Partial (logs only) Via instrumentation

Benchmarks:

Architecture

GreptimeDB can run in two modes:

  • Standalone Mode - Single binary for development and small deployments
  • Distributed Mode - Separate components for production scale:
    • Frontend: Query processing and protocol handling
    • Datanode: Data storage and retrieval
    • Metasrv: Metadata management and coordination

Read the architecture document. DeepWiki provides an in-depth look at GreptimeDB: GreptimeDB System Overview

Try GreptimeDB

docker pull greptime/greptimedb
docker run -p 127.0.0.1:4000-4003:4000-4003 \
  -v "$(pwd)/greptimedb_data:/greptimedb_data" \
  --name greptime --rm \
  greptime/greptimedb:latest standalone start \
  --http-addr 0.0.0.0:4000 \
  --rpc-bind-addr 0.0.0.0:4001 \
  --mysql-addr 0.0.0.0:4002 \
  --postgres-addr 0.0.0.0:4003

Dashboard: http://localhost:4000/dashboard

Read more in the full Install Guide.

Troubleshooting:

  • Cannot connect to the database? Ensure that ports 4000, 4001, 4002, and 4003 are not blocked by a firewall or used by other services.
  • Failed to start? Check the container logs with docker logs greptime for further details.

Getting Started

Build From Source

Prerequisites:

  • Rust toolchain (nightly)
  • Protobuf compiler (>= 3.15)
  • C/C++ building essentials, including gcc/g++/autoconf and glibc library (eg. libc6-dev on Ubuntu and glibc-devel on Fedora)
  • Python toolchain (optional): Required only if using some test scripts.

Build and Run:

make
cargo run -- standalone start

Tools & Extensions

Project Status

Status: RC — marching toward v1.0 GA! GA (v1.0): March 2026

  • Deployed in production handling billions of data points daily
  • Stable APIs, actively maintained, with regular releases (version info)

GreptimeDB v1.0 represents a major milestone toward maturity — marking stable APIs, production readiness, and proven performance.

Roadmap: v1.0 highlights and release plan and 2026 roadmap.

For production use, we recommend using the latest stable release.

If you find this project useful, a would mean a lot to us!

Star History Chart

Known Users

Community

We invite you to engage and contribute!

License

GreptimeDB is licensed under the Apache License 2.0.

Commercial Support

Running GreptimeDB in your organization? We offer enterprise add-ons, services, training, and consulting. Contact us for details.

Contributing

Acknowledgement

Special thanks to all contributors! See AUTHORS.md.

Description
Languages
Rust 99.6%