mirror of
https://github.com/GreptimeTeam/greptimedb.git
synced 2025-12-22 22:20:02 +00:00
feat: allow one to many VRL pipeline (#7342)
* feat/allow-one-to-many-pipeline: ### Enhance Pipeline Processing for One-to-Many Transformations - **Support One-to-Many Transformations**: - Updated `processor.rs`, `etl.rs`, `vrl_processor.rs`, and `greptime.rs` to handle one-to-many transformations by allowing VRL processors to return arrays, expanding each element into separate rows. - Introduced `transform_array_elements` and `values_to_rows` functions to facilitate this transformation. - **Error Handling Enhancements**: - Added new error types in `error.rs` to handle cases where array elements are not objects and for transformation failures. - **Testing Enhancements**: - Added tests in `pipeline.rs` to verify one-to-many transformations, single object processing, and error handling for non-object array elements. - **Context Management**: - Modified `ctx_req.rs` to clone `ContextOpt` when adding rows, ensuring correct context management during transformations. - **Server Pipeline Adjustments**: - Updated `pipeline.rs` in `servers` to handle transformed outputs with one-to-many row expansions, ensuring correct row padding and request formation. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: Add one-to-many VRL pipeline test in `http.rs` - Introduced `test_pipeline_one_to_many_vrl` to verify VRL processor's ability to expand a single input row into multiple output rows. - Updated `http_tests!` macro to include the new test. - Implemented test scenarios for single and multiple input rows, ensuring correct data transformation and row count validation. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Add Tests for VRL Pipeline Transformations - **File:** `src/pipeline/src/etl.rs` - Added tests for one-to-many VRL pipeline expansion to ensure multiple output rows from a single input. - Introduced tests to verify backward compatibility for single object output. - Implemented tests to confirm zero rows are produced from empty arrays. - Added validation tests to ensure array elements must be objects. - Developed tests for one-to-many transformations with table suffix hints from VRL. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Enhance Pipeline Transformation with Per-Row Table Suffixes - **`src/pipeline/src/etl.rs`**: Updated `TransformedOutput` to include per-row table suffixes, allowing for more flexible routing of transformed data. Modified `PipelineExecOutput` and related methods to handle the new structure. - **`src/pipeline/src/etl/transform/transformer/greptime.rs`**: Enhanced `values_to_rows` to support per-row table suffix extraction and application. - **`src/pipeline/tests/common.rs`** and **`src/pipeline/tests/pipeline.rs`**: Adjusted tests to validate the new per-row table suffix functionality, ensuring backward compatibility and correct behavior in one-to-many transformations. - **`src/servers/src/pipeline.rs`**: Modified `run_custom_pipeline` to process transformed outputs with per-row table suffixes, grouping rows by `(opt, table_name)` for insertion. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Update VRL Processor Type Checks - **File:** `vrl_processor.rs` - **Changes:** Updated type checking logic to use `contains_object()` and `contains_array()` methods instead of `is_object()` and `is_array()`. This change ensures compatibility with VRL type inference that may return multiple possible types. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: - **Enhance Error Handling**: Added new error types `ArrayElementMustBeObjectSnafu` and `TransformArrayElementSnafu` to improve error handling in `etl.rs` and `greptime.rs`. - **Refactor Error Usage**: Moved error usage declarations in `transform_array_elements` and `values_to_rows` functions to the top of the file for better organization in `etl.rs` and `greptime.rs`. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Update `greptime.rs` to Enhance Error Handling - **Error Handling**: Modified the `values_to_rows` function to handle invalid array elements based on the `skip_error` parameter. If `skip_error` is true, invalid elements are skipped; otherwise, an error is returned. - **Testing**: Added unit tests in `greptime.rs` to verify the behavior of `values_to_rows` with different `skip_error` settings, ensuring correct processing of valid objects and appropriate error handling for invalid elements. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Commit Summary - **Enhance `TransformedOutput` Structure**: Refactored `TransformedOutput` to use a `HashMap` for grouping rows by `ContextOpt`, allowing for per-row configuration options. Updated methods in `PipelineExecOutput` to support the new structure (`src/pipeline/src/etl.rs`). - **Add New Transformation Method**: Introduced `transform_array_elements_to_hashmap` to handle array inputs with per-row `ContextOpt` in `HashMap` format (`src/pipeline/src/etl.rs`). - **Update Pipeline Execution**: Modified `run_custom_pipeline` to process `TransformedOutput` using the new `HashMap` structure, ensuring rows are grouped by `ContextOpt` and table name (`src/servers/src/pipeline.rs`). - **Add Tests for New Structure**: Implemented tests to verify the functionality of the new `HashMap` structure in `TransformedOutput`, including scenarios for one-to-many mapping, single object input, and empty arrays (`src/pipeline/src/etl.rs`). Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Refactor `values_to_rows` to Return `HashMap` Grouped by `ContextOpt` - **`etl.rs`**: - Updated `values_to_rows` to return a `HashMap` grouped by `ContextOpt` instead of a vector. - Adjusted logic to handle single object and array inputs, ensuring rows are grouped by their `ContextOpt`. - Modified functions to extract rows from default `ContextOpt` and apply table suffixes accordingly. - **`greptime.rs`**: - Enhanced `values_to_rows` to handle errors gracefully with `skip_error` logic. - Added logic to group rows by `ContextOpt` for array inputs. - **Tests**: - Updated existing tests to validate the new `HashMap` return structure. - Added a new test to verify correct grouping of rows by per-element `ContextOpt`. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Refactor and Enhance Error Handling in ETL Pipeline - **Refactored Functionality**: - Replaced `transform_array_elements` with `transform_array_elements_by_ctx` in `etl.rs` to streamline transformation logic and improve error handling. - Updated `values_to_rows` in `greptime.rs` to use `or_default` for cleaner code. - **Enhanced Error Handling**: - Introduced `unwrap_or_continue_if_err` macro in `etl.rs` to allow skipping errors based on pipeline context, improving robustness in data processing. These changes enhance the maintainability and error resilience of the ETL pipeline. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Update `Row` Handling in ETL Pipeline - **Refactor `Row` Type**: Introduced `RowWithTableSuffix` type alias to simplify handling of rows with optional table suffixes across the ETL pipeline. - **Modify Function Signatures**: Updated function signatures in `etl.rs` and `greptime.rs` to use `RowWithTableSuffix` for better clarity and consistency. - **Enhance Test Coverage**: Adjusted test logic in `greptime.rs` to align with the new `RowWithTableSuffix` type, ensuring correct grouping and processing of rows by TTL. Files affected: `etl.rs`, `greptime.rs`. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> --------- Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
This commit is contained in:
@@ -122,6 +122,7 @@ macro_rules! http_tests {
|
||||
test_pipeline_context,
|
||||
test_pipeline_with_vrl,
|
||||
test_pipeline_with_hint_vrl,
|
||||
test_pipeline_one_to_many_vrl,
|
||||
test_pipeline_2,
|
||||
test_pipeline_skip_error,
|
||||
test_pipeline_filter,
|
||||
@@ -3285,6 +3286,151 @@ transform:
|
||||
guard.remove_all().await;
|
||||
}
|
||||
|
||||
/// Test one-to-many VRL pipeline expansion.
|
||||
/// This test verifies that a VRL processor can return an array, which results in
|
||||
/// multiple output rows from a single input row.
|
||||
pub async fn test_pipeline_one_to_many_vrl(storage_type: StorageType) {
|
||||
common_telemetry::init_default_ut_logging();
|
||||
let (app, mut guard) =
|
||||
setup_test_http_app_with_frontend(storage_type, "test_pipeline_one_to_many_vrl").await;
|
||||
|
||||
let client = TestClient::new(app).await;
|
||||
|
||||
// Pipeline that expands events array into multiple rows
|
||||
let pipeline = r#"
|
||||
processors:
|
||||
- date:
|
||||
field: timestamp
|
||||
formats:
|
||||
- "%Y-%m-%d %H:%M:%S"
|
||||
ignore_missing: true
|
||||
- vrl:
|
||||
source: |
|
||||
# Extract events array and expand each event into a separate row
|
||||
events = del(.events)
|
||||
base_host = del(.host)
|
||||
base_timestamp = del(.timestamp)
|
||||
|
||||
# Map each event to a complete row object
|
||||
map_values(array!(events)) -> |event| {
|
||||
{
|
||||
"host": base_host,
|
||||
"event_type": event.type,
|
||||
"event_value": event.value,
|
||||
"timestamp": base_timestamp
|
||||
}
|
||||
}
|
||||
|
||||
transform:
|
||||
- field: host
|
||||
type: string
|
||||
- field: event_type
|
||||
type: string
|
||||
- field: event_value
|
||||
type: int32
|
||||
- field: timestamp
|
||||
type: time
|
||||
index: timestamp
|
||||
"#;
|
||||
|
||||
// 1. create pipeline
|
||||
let res = client
|
||||
.post("/v1/events/pipelines/one_to_many")
|
||||
.header("Content-Type", "application/x-yaml")
|
||||
.body(pipeline)
|
||||
.send()
|
||||
.await;
|
||||
assert_eq!(res.status(), StatusCode::OK);
|
||||
|
||||
// 2. write data - single input with multiple events
|
||||
let data_body = r#"
|
||||
[
|
||||
{
|
||||
"host": "server1",
|
||||
"timestamp": "2024-05-25 20:16:37",
|
||||
"events": [
|
||||
{"type": "cpu", "value": 80},
|
||||
{"type": "memory", "value": 60},
|
||||
{"type": "disk", "value": 45}
|
||||
]
|
||||
}
|
||||
]
|
||||
"#;
|
||||
let res = client
|
||||
.post("/v1/events/logs?db=public&table=metrics&pipeline_name=one_to_many")
|
||||
.header("Content-Type", "application/json")
|
||||
.body(data_body)
|
||||
.send()
|
||||
.await;
|
||||
assert_eq!(res.status(), StatusCode::OK);
|
||||
|
||||
// 3. verify: one input row should produce three output rows
|
||||
validate_data(
|
||||
"test_pipeline_one_to_many_vrl_count",
|
||||
&client,
|
||||
"select count(*) from metrics",
|
||||
"[[3]]",
|
||||
)
|
||||
.await;
|
||||
|
||||
// 4. verify the actual data
|
||||
validate_data(
|
||||
"test_pipeline_one_to_many_vrl_data",
|
||||
&client,
|
||||
"select host, event_type, event_value from metrics order by event_type",
|
||||
"[[\"server1\",\"cpu\",80],[\"server1\",\"disk\",45],[\"server1\",\"memory\",60]]",
|
||||
)
|
||||
.await;
|
||||
|
||||
// 5. Test with multiple input rows, each producing multiple output rows
|
||||
let data_body2 = r#"
|
||||
[
|
||||
{
|
||||
"host": "server2",
|
||||
"timestamp": "2024-05-25 20:17:00",
|
||||
"events": [
|
||||
{"type": "cpu", "value": 90},
|
||||
{"type": "memory", "value": 70}
|
||||
]
|
||||
},
|
||||
{
|
||||
"host": "server3",
|
||||
"timestamp": "2024-05-25 20:18:00",
|
||||
"events": [
|
||||
{"type": "cpu", "value": 50}
|
||||
]
|
||||
}
|
||||
]
|
||||
"#;
|
||||
let res = client
|
||||
.post("/v1/events/logs?db=public&table=metrics&pipeline_name=one_to_many")
|
||||
.header("Content-Type", "application/json")
|
||||
.body(data_body2)
|
||||
.send()
|
||||
.await;
|
||||
assert_eq!(res.status(), StatusCode::OK);
|
||||
|
||||
// 6. verify total count: 3 (from first batch) + 2 + 1 = 6 rows
|
||||
validate_data(
|
||||
"test_pipeline_one_to_many_vrl_total_count",
|
||||
&client,
|
||||
"select count(*) from metrics",
|
||||
"[[6]]",
|
||||
)
|
||||
.await;
|
||||
|
||||
// 7. verify rows per host
|
||||
validate_data(
|
||||
"test_pipeline_one_to_many_vrl_per_host",
|
||||
&client,
|
||||
"select host, count(*) as cnt from metrics group by host order by host",
|
||||
"[[\"server1\",3],[\"server2\",2],[\"server3\",1]]",
|
||||
)
|
||||
.await;
|
||||
|
||||
guard.remove_all().await;
|
||||
}
|
||||
|
||||
pub async fn test_pipeline_2(storage_type: StorageType) {
|
||||
common_telemetry::init_default_ut_logging();
|
||||
let (app, mut guard) = setup_test_http_app_with_frontend(storage_type, "test_pipeline_2").await;
|
||||
|
||||
Reference in New Issue
Block a user