feat: allow one to many VRL pipeline (#7342)

* feat/allow-one-to-many-pipeline: ### Enhance Pipeline Processing for One-to-Many Transformations - **Support One-to-Many Transformations**: - Updated `processor.rs`, `etl.rs`, `vrl_processor.rs`, and `greptime.rs` to handle one-to-many transformations by allowing VRL processors to return arrays, expanding each element into separate rows. - Introduced `transform_array_elements` and `values_to_rows` functions to facilitate this transformation. - **Error Handling Enhancements**: - Added new error types in `error.rs` to handle cases where array elements are not objects and for transformation failures. - **Testing Enhancements**: - Added tests in `pipeline.rs` to verify one-to-many transformations, single object processing, and error handling for non-object array elements. - **Context Management**: - Modified `ctx_req.rs` to clone `ContextOpt` when adding rows, ensuring correct context management during transformations. - **Server Pipeline Adjustments**: - Updated `pipeline.rs` in `servers` to handle transformed outputs with one-to-many row expansions, ensuring correct row padding and request formation. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: Add one-to-many VRL pipeline test in `http.rs` - Introduced `test_pipeline_one_to_many_vrl` to verify VRL processor's ability to expand a single input row into multiple output rows. - Updated `http_tests!` macro to include the new test. - Implemented test scenarios for single and multiple input rows, ensuring correct data transformation and row count validation. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Add Tests for VRL Pipeline Transformations - **File:** `src/pipeline/src/etl.rs` - Added tests for one-to-many VRL pipeline expansion to ensure multiple output rows from a single input. - Introduced tests to verify backward compatibility for single object output. - Implemented tests to confirm zero rows are produced from empty arrays. - Added validation tests to ensure array elements must be objects. - Developed tests for one-to-many transformations with table suffix hints from VRL. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Enhance Pipeline Transformation with Per-Row Table Suffixes - **`src/pipeline/src/etl.rs`**: Updated `TransformedOutput` to include per-row table suffixes, allowing for more flexible routing of transformed data. Modified `PipelineExecOutput` and related methods to handle the new structure. - **`src/pipeline/src/etl/transform/transformer/greptime.rs`**: Enhanced `values_to_rows` to support per-row table suffix extraction and application. - **`src/pipeline/tests/common.rs`** and **`src/pipeline/tests/pipeline.rs`**: Adjusted tests to validate the new per-row table suffix functionality, ensuring backward compatibility and correct behavior in one-to-many transformations. - **`src/servers/src/pipeline.rs`**: Modified `run_custom_pipeline` to process transformed outputs with per-row table suffixes, grouping rows by `(opt, table_name)` for insertion. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Update VRL Processor Type Checks - **File:** `vrl_processor.rs` - **Changes:** Updated type checking logic to use `contains_object()` and `contains_array()` methods instead of `is_object()` and `is_array()`. This change ensures compatibility with VRL type inference that may return multiple possible types. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: - **Enhance Error Handling**: Added new error types `ArrayElementMustBeObjectSnafu` and `TransformArrayElementSnafu` to improve error handling in `etl.rs` and `greptime.rs`. - **Refactor Error Usage**: Moved error usage declarations in `transform_array_elements` and `values_to_rows` functions to the top of the file for better organization in `etl.rs` and `greptime.rs`. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Update `greptime.rs` to Enhance Error Handling - **Error Handling**: Modified the `values_to_rows` function to handle invalid array elements based on the `skip_error` parameter. If `skip_error` is true, invalid elements are skipped; otherwise, an error is returned. - **Testing**: Added unit tests in `greptime.rs` to verify the behavior of `values_to_rows` with different `skip_error` settings, ensuring correct processing of valid objects and appropriate error handling for invalid elements. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Commit Summary - **Enhance `TransformedOutput` Structure**: Refactored `TransformedOutput` to use a `HashMap` for grouping rows by `ContextOpt`, allowing for per-row configuration options. Updated methods in `PipelineExecOutput` to support the new structure (`src/pipeline/src/etl.rs`). - **Add New Transformation Method**: Introduced `transform_array_elements_to_hashmap` to handle array inputs with per-row `ContextOpt` in `HashMap` format (`src/pipeline/src/etl.rs`). - **Update Pipeline Execution**: Modified `run_custom_pipeline` to process `TransformedOutput` using the new `HashMap` structure, ensuring rows are grouped by `ContextOpt` and table name (`src/servers/src/pipeline.rs`). - **Add Tests for New Structure**: Implemented tests to verify the functionality of the new `HashMap` structure in `TransformedOutput`, including scenarios for one-to-many mapping, single object input, and empty arrays (`src/pipeline/src/etl.rs`). Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Refactor `values_to_rows` to Return `HashMap` Grouped by `ContextOpt` - **`etl.rs`**: - Updated `values_to_rows` to return a `HashMap` grouped by `ContextOpt` instead of a vector. - Adjusted logic to handle single object and array inputs, ensuring rows are grouped by their `ContextOpt`. - Modified functions to extract rows from default `ContextOpt` and apply table suffixes accordingly. - **`greptime.rs`**: - Enhanced `values_to_rows` to handle errors gracefully with `skip_error` logic. - Added logic to group rows by `ContextOpt` for array inputs. - **Tests**: - Updated existing tests to validate the new `HashMap` return structure. - Added a new test to verify correct grouping of rows by per-element `ContextOpt`. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Refactor and Enhance Error Handling in ETL Pipeline - **Refactored Functionality**: - Replaced `transform_array_elements` with `transform_array_elements_by_ctx` in `etl.rs` to streamline transformation logic and improve error handling. - Updated `values_to_rows` in `greptime.rs` to use `or_default` for cleaner code. - **Enhanced Error Handling**: - Introduced `unwrap_or_continue_if_err` macro in `etl.rs` to allow skipping errors based on pipeline context, improving robustness in data processing. These changes enhance the maintainability and error resilience of the ETL pipeline. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> * feat/allow-one-to-many-pipeline: ### Update `Row` Handling in ETL Pipeline - **Refactor `Row` Type**: Introduced `RowWithTableSuffix` type alias to simplify handling of rows with optional table suffixes across the ETL pipeline. - **Modify Function Signatures**: Updated function signatures in `etl.rs` and `greptime.rs` to use `RowWithTableSuffix` for better clarity and consistency. - **Enhance Test Coverage**: Adjusted test logic in `greptime.rs` to align with the new `RowWithTableSuffix` type, ensuring correct grouping and processing of rows by TTL. Files affected: `etl.rs`, `greptime.rs`. Signed-off-by: Lei, HUANG <mrsatangel@gmail.com> --------- Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
2025-12-22 22:20:02 +00:00 · 2025-12-10 14:38:44 +08:00
parent 2f9130a2de
commit 9f1aefe98f
10 changed files with 1401 additions and 111 deletions
--- a/tests-integration/tests/http.rs
+++ b/tests-integration/tests/http.rs
@@ -122,6 +122,7 @@ macro_rules! http_tests {
                test_pipeline_context,
                test_pipeline_with_vrl,
                test_pipeline_with_hint_vrl,
+                test_pipeline_one_to_many_vrl,
                test_pipeline_2,
                test_pipeline_skip_error,
                test_pipeline_filter,
@@ -3285,6 +3286,151 @@ transform:
    guard.remove_all().await;
 }

+/// Test one-to-many VRL pipeline expansion.
+/// This test verifies that a VRL processor can return an array, which results in
+/// multiple output rows from a single input row.
+pub async fn test_pipeline_one_to_many_vrl(storage_type: StorageType) {
+    common_telemetry::init_default_ut_logging();
+    let (app, mut guard) =
+        setup_test_http_app_with_frontend(storage_type, "test_pipeline_one_to_many_vrl").await;
+
+    let client = TestClient::new(app).await;
+
+    // Pipeline that expands events array into multiple rows
+    let pipeline = r#"
+processors:
+  - date:
+      field: timestamp
+      formats:
+        - "%Y-%m-%d %H:%M:%S"
+      ignore_missing: true
+  - vrl:
+      source: |
+        # Extract events array and expand each event into a separate row
+        events = del(.events)
+        base_host = del(.host)
+        base_timestamp = del(.timestamp)
+        
+        # Map each event to a complete row object
+        map_values(array!(events)) -> |event| {
+            {
+                "host": base_host,
+                "event_type": event.type,
+                "event_value": event.value,
+                "timestamp": base_timestamp
+            }
+        }
+
+transform:
+  - field: host
+    type: string
+  - field: event_type
+    type: string
+  - field: event_value
+    type: int32
+  - field: timestamp
+    type: time
+    index: timestamp
+"#;
+
+    // 1. create pipeline
+    let res = client
+        .post("/v1/events/pipelines/one_to_many")
+        .header("Content-Type", "application/x-yaml")
+        .body(pipeline)
+        .send()
+        .await;
+    assert_eq!(res.status(), StatusCode::OK);
+
+    // 2. write data - single input with multiple events
+    let data_body = r#"
+[
+  {
+    "host": "server1",
+    "timestamp": "2024-05-25 20:16:37",
+    "events": [
+      {"type": "cpu", "value": 80},
+      {"type": "memory", "value": 60},
+      {"type": "disk", "value": 45}
+    ]
+  }
+]
+"#;
+    let res = client
+        .post("/v1/events/logs?db=public&table=metrics&pipeline_name=one_to_many")
+        .header("Content-Type", "application/json")
+        .body(data_body)
+        .send()
+        .await;
+    assert_eq!(res.status(), StatusCode::OK);
+
+    // 3. verify: one input row should produce three output rows
+    validate_data(
+        "test_pipeline_one_to_many_vrl_count",
+        &client,
+        "select count(*) from metrics",
+        "[[3]]",
+    )
+    .await;
+
+    // 4. verify the actual data
+    validate_data(
+        "test_pipeline_one_to_many_vrl_data",
+        &client,
+        "select host, event_type, event_value from metrics order by event_type",
+        "[[\"server1\",\"cpu\",80],[\"server1\",\"disk\",45],[\"server1\",\"memory\",60]]",
+    )
+    .await;
+
+    // 5. Test with multiple input rows, each producing multiple output rows
+    let data_body2 = r#"
+[
+  {
+    "host": "server2",
+    "timestamp": "2024-05-25 20:17:00",
+    "events": [
+      {"type": "cpu", "value": 90},
+      {"type": "memory", "value": 70}
+    ]
+  },
+  {
+    "host": "server3",
+    "timestamp": "2024-05-25 20:18:00",
+    "events": [
+      {"type": "cpu", "value": 50}
+    ]
+  }
+]
+"#;
+    let res = client
+        .post("/v1/events/logs?db=public&table=metrics&pipeline_name=one_to_many")
+        .header("Content-Type", "application/json")
+        .body(data_body2)
+        .send()
+        .await;
+    assert_eq!(res.status(), StatusCode::OK);
+
+    // 6. verify total count: 3 (from first batch) + 2 + 1 = 6 rows
+    validate_data(
+        "test_pipeline_one_to_many_vrl_total_count",
+        &client,
+        "select count(*) from metrics",
+        "[[6]]",
+    )
+    .await;
+
+    // 7. verify rows per host
+    validate_data(
+        "test_pipeline_one_to_many_vrl_per_host",
+        &client,
+        "select host, count(*) as cnt from metrics group by host order by host",
+        "[[\"server1\",3],[\"server2\",2],[\"server3\",1]]",
+    )
+    .await;
+
+    guard.remove_all().await;
+}
+
 pub async fn test_pipeline_2(storage_type: StorageType) {
    common_telemetry::init_default_ut_logging();
    let (app, mut guard) = setup_test_http_app_with_frontend(storage_type, "test_pipeline_2").await;