feat: allow one to many VRL pipeline (#7342)

* feat/allow-one-to-many-pipeline:
 ### Enhance Pipeline Processing for One-to-Many Transformations

 - **Support One-to-Many Transformations**:
   - Updated `processor.rs`, `etl.rs`, `vrl_processor.rs`, and `greptime.rs` to handle one-to-many transformations by allowing VRL processors to return arrays, expanding each element into separate rows.
   - Introduced `transform_array_elements` and `values_to_rows` functions to facilitate this transformation.

 - **Error Handling Enhancements**:
   - Added new error types in `error.rs` to handle cases where array elements are not objects and for transformation failures.

 - **Testing Enhancements**:
   - Added tests in `pipeline.rs` to verify one-to-many transformations, single object processing, and error handling for non-object array elements.

 - **Context Management**:
   - Modified `ctx_req.rs` to clone `ContextOpt` when adding rows, ensuring correct context management during transformations.

 - **Server Pipeline Adjustments**:
   - Updated `pipeline.rs` in `servers` to handle transformed outputs with one-to-many row expansions, ensuring correct row padding and request formation.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 Add one-to-many VRL pipeline test in `http.rs`

 - Introduced `test_pipeline_one_to_many_vrl` to verify VRL processor's ability to expand a single input row into multiple output rows.
 - Updated `http_tests!` macro to include the new test.
 - Implemented test scenarios for single and multiple input rows, ensuring correct data transformation and row count validation.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 ### Add Tests for VRL Pipeline Transformations

 - **File:** `src/pipeline/src/etl.rs`
   - Added tests for one-to-many VRL pipeline expansion to ensure multiple output rows from a single input.
   - Introduced tests to verify backward compatibility for single object output.
   - Implemented tests to confirm zero rows are produced from empty arrays.
   - Added validation tests to ensure array elements must be objects.
   - Developed tests for one-to-many transformations with table suffix hints from VRL.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 ### Enhance Pipeline Transformation with Per-Row Table Suffixes

 - **`src/pipeline/src/etl.rs`**: Updated `TransformedOutput` to include per-row table suffixes, allowing for more flexible routing of transformed data. Modified `PipelineExecOutput` and related methods to
 handle the new structure.
 - **`src/pipeline/src/etl/transform/transformer/greptime.rs`**: Enhanced `values_to_rows` to support per-row table suffix extraction and application.
 - **`src/pipeline/tests/common.rs`** and **`src/pipeline/tests/pipeline.rs`**: Adjusted tests to validate the new per-row table suffix functionality, ensuring backward compatibility and correct behavior in
 one-to-many transformations.
 - **`src/servers/src/pipeline.rs`**: Modified `run_custom_pipeline` to process transformed outputs with per-row table suffixes, grouping rows by `(opt, table_name)` for insertion.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 ### Update VRL Processor Type Checks

 - **File:** `vrl_processor.rs`
 - **Changes:** Updated type checking logic to use `contains_object()` and `contains_array()` methods instead of `is_object()` and `is_array()`. This change ensures
 compatibility with VRL type inference that may return multiple possible types.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 - **Enhance Error Handling**: Added new error types `ArrayElementMustBeObjectSnafu` and `TransformArrayElementSnafu` to improve error handling in `etl.rs` and `greptime.rs`.
 - **Refactor Error Usage**: Moved error usage declarations in `transform_array_elements` and `values_to_rows` functions to the top of the file for better organization in `etl.rs` and `greptime.rs`.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 ### Update `greptime.rs` to Enhance Error Handling

 - **Error Handling**: Modified the `values_to_rows` function to handle invalid array elements based on the `skip_error` parameter. If `skip_error` is true, invalid elements are skipped; otherwise, an error is returned.
 - **Testing**: Added unit tests in `greptime.rs` to verify the behavior of `values_to_rows` with different `skip_error` settings, ensuring correct processing of valid objects and appropriate error handling for invalid elements.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 ### Commit Summary

 - **Enhance `TransformedOutput` Structure**: Refactored `TransformedOutput` to use a `HashMap` for grouping rows by `ContextOpt`, allowing for per-row configuration options. Updated methods in `PipelineExecOutput` to support the new structure (`src/pipeline/src/etl.rs`).

 - **Add New Transformation Method**: Introduced `transform_array_elements_to_hashmap` to handle array inputs with per-row `ContextOpt` in `HashMap` format (`src/pipeline/src/etl.rs`).

 - **Update Pipeline Execution**: Modified `run_custom_pipeline` to process `TransformedOutput` using the new `HashMap` structure, ensuring rows are grouped by `ContextOpt` and table name (`src/servers/src/pipeline.rs`).

 - **Add Tests for New Structure**: Implemented tests to verify the functionality of the new `HashMap` structure in `TransformedOutput`, including scenarios for one-to-many mapping, single object input, and empty arrays (`src/pipeline/src/etl.rs`).

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 ### Refactor `values_to_rows` to Return `HashMap` Grouped by `ContextOpt`

 - **`etl.rs`**:
   - Updated `values_to_rows` to return a `HashMap` grouped by `ContextOpt` instead of a vector.
   - Adjusted logic to handle single object and array inputs, ensuring rows are grouped by their `ContextOpt`.
   - Modified functions to extract rows from default `ContextOpt` and apply table suffixes accordingly.

 - **`greptime.rs`**:
   - Enhanced `values_to_rows` to handle errors gracefully with `skip_error` logic.
   - Added logic to group rows by `ContextOpt` for array inputs.

 - **Tests**:
   - Updated existing tests to validate the new `HashMap` return structure.
   - Added a new test to verify correct grouping of rows by per-element `ContextOpt`.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 ### Refactor and Enhance Error Handling in ETL Pipeline

 - **Refactored Functionality**:
   - Replaced `transform_array_elements` with `transform_array_elements_by_ctx` in `etl.rs` to streamline transformation logic and improve error handling.
   - Updated `values_to_rows` in `greptime.rs` to use `or_default` for cleaner code.

 - **Enhanced Error Handling**:
   - Introduced `unwrap_or_continue_if_err` macro in `etl.rs` to allow skipping errors based on pipeline context, improving robustness in data processing.

 These changes enhance the maintainability and error resilience of the ETL pipeline.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

* feat/allow-one-to-many-pipeline:
 ### Update `Row` Handling in ETL Pipeline

 - **Refactor `Row` Type**: Introduced `RowWithTableSuffix` type alias to simplify handling of rows with optional table suffixes across the ETL pipeline.
 - **Modify Function Signatures**: Updated function signatures in `etl.rs` and `greptime.rs` to use `RowWithTableSuffix` for better clarity and consistency.
 - **Enhance Test Coverage**: Adjusted test logic in `greptime.rs` to align with the new `RowWithTableSuffix` type, ensuring correct grouping and processing of rows by TTL.

 Files affected: `etl.rs`, `greptime.rs`.

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>

---------

Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
This commit is contained in:
Lei, HUANG
2025-12-10 14:38:44 +08:00
committed by GitHub
parent 2f9130a2de
commit 9f1aefe98f
10 changed files with 1401 additions and 111 deletions

View File

@@ -122,6 +122,7 @@ macro_rules! http_tests {
test_pipeline_context,
test_pipeline_with_vrl,
test_pipeline_with_hint_vrl,
test_pipeline_one_to_many_vrl,
test_pipeline_2,
test_pipeline_skip_error,
test_pipeline_filter,
@@ -3285,6 +3286,151 @@ transform:
guard.remove_all().await;
}
/// Test one-to-many VRL pipeline expansion.
/// This test verifies that a VRL processor can return an array, which results in
/// multiple output rows from a single input row.
pub async fn test_pipeline_one_to_many_vrl(storage_type: StorageType) {
common_telemetry::init_default_ut_logging();
let (app, mut guard) =
setup_test_http_app_with_frontend(storage_type, "test_pipeline_one_to_many_vrl").await;
let client = TestClient::new(app).await;
// Pipeline that expands events array into multiple rows
let pipeline = r#"
processors:
- date:
field: timestamp
formats:
- "%Y-%m-%d %H:%M:%S"
ignore_missing: true
- vrl:
source: |
# Extract events array and expand each event into a separate row
events = del(.events)
base_host = del(.host)
base_timestamp = del(.timestamp)
# Map each event to a complete row object
map_values(array!(events)) -> |event| {
{
"host": base_host,
"event_type": event.type,
"event_value": event.value,
"timestamp": base_timestamp
}
}
transform:
- field: host
type: string
- field: event_type
type: string
- field: event_value
type: int32
- field: timestamp
type: time
index: timestamp
"#;
// 1. create pipeline
let res = client
.post("/v1/events/pipelines/one_to_many")
.header("Content-Type", "application/x-yaml")
.body(pipeline)
.send()
.await;
assert_eq!(res.status(), StatusCode::OK);
// 2. write data - single input with multiple events
let data_body = r#"
[
{
"host": "server1",
"timestamp": "2024-05-25 20:16:37",
"events": [
{"type": "cpu", "value": 80},
{"type": "memory", "value": 60},
{"type": "disk", "value": 45}
]
}
]
"#;
let res = client
.post("/v1/events/logs?db=public&table=metrics&pipeline_name=one_to_many")
.header("Content-Type", "application/json")
.body(data_body)
.send()
.await;
assert_eq!(res.status(), StatusCode::OK);
// 3. verify: one input row should produce three output rows
validate_data(
"test_pipeline_one_to_many_vrl_count",
&client,
"select count(*) from metrics",
"[[3]]",
)
.await;
// 4. verify the actual data
validate_data(
"test_pipeline_one_to_many_vrl_data",
&client,
"select host, event_type, event_value from metrics order by event_type",
"[[\"server1\",\"cpu\",80],[\"server1\",\"disk\",45],[\"server1\",\"memory\",60]]",
)
.await;
// 5. Test with multiple input rows, each producing multiple output rows
let data_body2 = r#"
[
{
"host": "server2",
"timestamp": "2024-05-25 20:17:00",
"events": [
{"type": "cpu", "value": 90},
{"type": "memory", "value": 70}
]
},
{
"host": "server3",
"timestamp": "2024-05-25 20:18:00",
"events": [
{"type": "cpu", "value": 50}
]
}
]
"#;
let res = client
.post("/v1/events/logs?db=public&table=metrics&pipeline_name=one_to_many")
.header("Content-Type", "application/json")
.body(data_body2)
.send()
.await;
assert_eq!(res.status(), StatusCode::OK);
// 6. verify total count: 3 (from first batch) + 2 + 1 = 6 rows
validate_data(
"test_pipeline_one_to_many_vrl_total_count",
&client,
"select count(*) from metrics",
"[[6]]",
)
.await;
// 7. verify rows per host
validate_data(
"test_pipeline_one_to_many_vrl_per_host",
&client,
"select host, count(*) as cnt from metrics group by host order by host",
"[[\"server1\",3],[\"server2\",2],[\"server3\",1]]",
)
.await;
guard.remove_all().await;
}
pub async fn test_pipeline_2(storage_type: StorageType) {
common_telemetry::init_default_ut_logging();
let (app, mut guard) = setup_test_http_app_with_frontend(storage_type, "test_pipeline_2").await;