* feat: use correct projection index for old format
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: remove allow dead_code from format
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: check and convert old format to flat format
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: sub primary key num from projection
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: always convert the batch in FlatRowGroupReader
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: Change &Option<&[]> to Option<&[]>
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: only build arrow schema once
adds a method flat_sst_arrow_schema_column_num() to get the field num
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: Handle flat format and old format separately
Adds two structs ParquetFlat and ParquetPrimaryKeyToFlat.
ParquetPrimaryKeyToFlat delegates stats and projection to the
PrimaryKeyReadFormat.
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: handle non string tag correctly
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: do not register file cache twice
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: clean temp files
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: add rows and bytes to flush success log
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: convert format in memtable
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: add compaction flag to ScanInput
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: compaction should use old format for sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: merge schema use old format in sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: reads legacy format but not convert if skip_auto_convert
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: suppport sparse encoding in bulk parts
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* bulk-multiparts-merge-reader:
**Enhance Memtable Iteration and Flushing Logic**
- **`flush.rs`**: Updated `RegionFlushTask` to handle multiple ranges using `MergeReaderBuilder` for improved source management during flush operations.
- **`memtable.rs`**: Introduced `build_prune_iter` and `build_iter` methods in `MemtableRange` for flexible iteration. Added `MemtableRanges` struct to manage multiple contexts.
- **`simple_bulk_memtable.rs`**: Refactored to use `BatchIterBuilder` and `BatchIterBuilderDeprecated` for iteration, supporting new `read_to_values` method in `Series`.
- **`time_series.rs`**: Added `read_to_values` and `finish_cloned` methods in `Series` and `ValueBuilder` for efficient data handling.
- **`scan_util.rs`**: Replaced `build_iter` with `build_prune_iter` for range iteration, enhancing scan utility.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
- **Add Rayon for Parallel Processing**: Introduced `rayon` for parallel processing in `simple_bulk_memtable.rs` and updated `Cargo.toml` and `Cargo.lock` to include `rayon` dependency.
- **Enhance Benchmarking**: Added new benchmarks in `simple_bulk_memtable.rs` to compare parallel vs sequential processing, projection, sequence filtering, and write performance.
- **Make Structs and Methods Public**: Changed visibility of several structs and methods to `pub` in `simple_bulk_memtable.rs`, `memtable.rs`, `time_series.rs`, and `test_util.rs` to facilitate testing and benchmarking.
- **Update Criterion Features**: Modified `Cargo.toml` to include `html_reports` feature for `criterion`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
### Commit Summary
- **Refactor `SimpleBulkMemtable`**:
- Moved `ranges_sequential` function to a new `test_only` module and made it a method of `SimpleBulkMemtable`.
- Made several fields in `SimpleBulkMemtable` private and added a `region_metadata` getter.
- Affected files: `simple_bulk_memtable.rs`, `test_only.rs`.
- **Benchmark Adjustments**:
- Updated benchmark functions to use the new `ranges_sequential` method.
- Affected file: `simple_bulk_memtable.rs`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
### Add Test Configuration for `iter` Method in Memtable Implementations
- **Enhancements**:
- Added `#[cfg(any(test, feature = "test"))]` attribute to the `iter` method in various `Memtable` implementations to enable conditional compilation for testing purposes.
- Affected files:
- `src/mito2/src/memtable.rs`
- `src/mito2/src/memtable/bulk.rs`
- `src/mito2/src/memtable/partition_tree.rs`
- `src/mito2/src/memtable/simple_bulk_memtable.rs`
- `src/mito2/src/memtable/time_series.rs`
- `src/mito2/src/test_util/memtable_util.rs`
- **Benchmark Adjustments**:
- Removed `black_box` usage in `bench_memtable_write_performance` function to streamline benchmarking.
- Affected file: `src/mito2/benches/simple_bulk_memtable.rs`
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
**Enhance Async Support and Refactor Iteration in `mito2`**
- **Add Async Features**: Updated `Cargo.toml` to include `async` and `async_tokio` features for `criterion`.
- **Async Iteration**: Introduced async functions `flush` and `flush_original` in `simple_bulk_memtable.rs` to handle memtable flushing using async iterators.
- **Refactor Iteration Logic**: Moved `create_iter` and `BatchIterBuilderDeprecated` to `test_only.rs` for better separation of concerns.
- **Public API Change**: Made `next_batch` in `read.rs` public to support async batch processing.
- **Benchmark Updates**: Modified benchmarks in `simple_bulk_memtable.rs` to use async runtime for performance testing.
Files affected: `Cargo.toml`, `simple_bulk_memtable.rs`, `test_only.rs`, `read.rs`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
**Enhance Benchmarking for Memtable**
- Refactored `create_large_memtable` to `create_memtable_with_rows` in `simple_bulk_memtable.rs` to allow dynamic row count configuration.
- Introduced parameterized benchmarking in `bench_ranges_parallel_vs_sequential` to test various row counts, improving the flexibility and coverage of performance tests.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
### Enhance Memory Management and Public API
- **`builder.rs`**: Made `next_offset` method public to allow external access to offset calculations.
- **`simple_bulk_memtable.rs`**: Simplified the `series.extend` method by removing the iterator conversion for `fields`.
- **`time_series.rs`**:
- Added `can_accommodate` method to `ValueBuilder` to check if fields can be accommodated without offset overflow.
- Modified `extend` method to use a `Vec` for `fields` instead of an iterator, improving memory management and error handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
Add License and Enhance Testing in `simple_bulk_memtable.rs`
- Added Apache License header to `simple_bulk_memtable.rs`.
- Modified test configuration in `simple_bulk_memtable.rs` to include `any(test, feature = "test")`.
- Introduced a new test `test_write_read_large_string` in `simple_bulk_memtable.rs` to verify handling of large strings.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
Update `Cargo.toml` dependencies
- Adjust features for `common-meta` and `mito-codec` to include "testing".
- Maintain `criterion` version and features for async support.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
### Update Predicate Type in Memtable Iterators
- **Files Modified**:
- `src/mito2/src/memtable.rs`
- `src/mito2/src/memtable/bulk.rs`
- `src/mito2/src/memtable/simple_bulk_memtable.rs`
- **Key Changes**:
- Updated the `iter` method in `Memtable` trait and its implementations to use `Option<table::predicate::Predicate>` instead of `Option<Predicate>`.
- Adjusted return type in `BulkMemtable`'s `iter` method to `Result<crate::memtable::BoxedBatchIterator>`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
**Enhance Memtable Functionality**
- **`memtable.rs`**:
- Added `Clone` trait to `MemtableStats` and made `num_ranges` public.
- Introduced `num_rows` field in `MemtableRange` and updated its constructor.
- Added `num_rows` method to `MemtableRange`.
- **`partition_tree.rs`, `simple_bulk_memtable.rs`, `time_series.rs`**:
- Updated `MemtableRange` instantiation to include `num_rows`.
- **`range.rs`**:
- Refactored `MemRangeBuilder` to handle a single `MemtableRange` and `MemtableStats`.
- **`scan_region.rs`**:
- Enhanced memtable filtering based on time range and updated `MemRangeBuilder` usage.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
**Enhancements and Bug Fixes**
- **Deduplication Enhancements**:
- Introduced `DedupReader` and `LastRow` as public structs in `dedup.rs` to enhance deduplication capabilities.
- Added `LastNonNull` deduplication strategy in `flush.rs` and `simple_bulk_memtable.rs`.
- **Memtable Improvements**:
- Updated `SimpleBulkMemtable` to support batch size configuration and deduplication strategies.
- Modified `Series` struct in `time_series.rs` to include a configurable capacity.
- **Testing Enhancements**:
- Added new test `test_write_dedup` in `simple_bulk_memtable.rs` to verify deduplication functionality.
- Updated existing tests to include `OpType` parameter for better operation type handling.
- **Refactoring**:
- Renamed `BatchIterBuilder` to `BatchRangeBuilder` in `simple_bulk_memtable.rs` for clarity.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* bulk-multiparts-merge-reader:
- **Refactor `flush.rs`:** Removed `LastNonNullIter` usage and adjusted `DedupReader` instantiation to use `LastRow::new(false)` and `LastNonNull::new(false)`.
- **Enhance `simple_bulk_memtable.rs`:** Added logic to handle `LastNonNull` merge mode in `IterBuilder`. Introduced new tests: `test_delete_only` and `test_single_range` to verify delete operations and single range handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* fix: tests
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat/bulk-wal:
### Refactor: Simplify Data Handling in LogStore Implementations
- **`kafka/log_store.rs`, `raft_engine/log_store.rs`, `wal.rs`, `raw_entry_reader.rs`, `logstore.rs`:**
- Refactored `entry` and `build_entry` functions to accept `Vec<u8>` directly instead of `&mut Vec<u8>`.
- Removed usage of `std::mem::take` for data handling, simplifying the code and improving readability.
- Updated test cases to align with the new function signatures.
* feat/bulk-wal:
### Add Support for Bulk WAL Entries and Flight Data Encoding
- **Add `raw_data` field to `BulkPart` and related structs**: Updated `BulkPart` and related structures in `src/mito2/src/memtable/bulk/part.rs`, `src/mito2/src/memtable/simple_bulk_memtable.rs`, `src/mito2/src/memtable/time_partition.rs`, `src/mito2/src/region_write_ctx.rs`,
`src/mito2/src/worker/handle_bulk_insert.rs`, and `src/store-api/src/region_request.rs` to include a new `raw_data` field for handling Arrow IPC data.
- **Implement Flight Data Encoding**: Added a new module `flight` in `src/common/test-util/src/flight.rs` to encode record batches to Flight data format.
- **Update `greptime-proto` dependency**: Changed the revision of the `greptime-proto` dependency in `Cargo.lock` and `Cargo.toml`.
- **Enhance WAL Writer and Tests**: Modified `src/mito2/src/wal.rs` and related test files to support bulk WAL entries and added tests for encoding and handling bulk data.
* feat/bulk-wal:
- **Update `greptime-proto` Dependency**: Updated the `greptime-proto` dependency to a new revision in `Cargo.lock` and `Cargo.toml`.
- **Add `common-grpc` Dependency**: Added `common-grpc` as a dependency in `Cargo.lock` and `src/mito2/Cargo.toml`.
- **Refactor `BulkPart` Structure**: Removed `num_rows` field and added `num_rows()` method in `src/mito2/src/memtable/bulk/part.rs`. Updated related usages in `src/mito2/src/memtable/simple_bulk_memtable.rs`, `src/mito2/src/memtable/time_partition.rs`, `src/mito2/src/memtable/time_series.rs`,
`src/mito2/src/region_write_ctx.rs`, and `src/mito2/src/worker/handle_bulk_insert.rs`.
- **Implement `TryFrom` and `From` for `BulkWalEntry`**: Added implementations for converting between `BulkPart` and `BulkWalEntry` in `src/mito2/src/memtable/bulk/part.rs`.
- **Handle Bulk Entries in Region Opener**: Added logic to process bulk entries in `src/mito2/src/region/opener.rs`.
- **Fix `BulkInsertRequest` Handling**: Corrected `region_id` handling in `src/operator/src/bulk_insert.rs` and `src/store-api/src/region_request.rs`.
- **Add Error Variant for `ConvertBulkWalEntry`**: Added a new error variant in `src/mito2/src/error.rs` for handling bulk WAL entry conversion errors.
* fix: ci
* feat/bulk-wal:
Add bulk write operation in `opener.rs`
- Enhanced the region write context by adding a call to `write_bulk()` after `write_memtable()` in `opener.rs`.
- This change aims to improve the efficiency of writing operations by enabling bulk writes.
* feat/bulk-wal:
Enhance error handling and metrics in `bulk_insert.rs`
- Updated `Inserter` to improve error handling by capturing the result of `datanode.handle(request)` and incrementing the `DIST_INGEST_ROW_COUNT` metric with the number of affected rows.
* feat/bulk-wal:
### Remove Encode Error Handling for WAL Entries
- **`error.rs`**: Removed the `EncodeWal` error variant and its associated handling.
- **`wal.rs`**: Eliminated the `entry_encode_buf` buffer and its usage for encoding WAL entries. Replaced with direct encoding to a vector using `encode_to_vec()`.
* - **Refactor `RegionFilePathFactory` to `RegionFilePathProvider`:** Updated references and implementations in `access_layer.rs`, `write_cache.rs`, and related test files to use the new struct name.
- **Add `max_file_size` support in compaction:** Introduced `max_file_size` option in `PickerOutput`, `SerializedPickerOutput`, and `WriteOptions` in `compactor.rs`, `picker.rs`, `twcs.rs`, and `window.rs`.
- **Enhance Parquet writing logic:** Modified `parquet.rs` and `parquet/writer.rs` to support optional `max_file_size` and added a test case `test_write_multiple_files` to verify writing multiple files based on size constraints.
**Refactor Parquet Writer Initialization and File Handling**
- Updated `ParquetWriter` in `writer.rs` to handle `current_indexer` as an `Option`, allowing for more flexible initialization and management.
- Introduced `finish_current_file` method to encapsulate logic for completing and transitioning between SST files, improving code clarity and maintainability.
- Enhanced error handling and logging with `debug` statements for better traceability during file operations.
- **Removed Output Size Enforcement in `twcs.rs`:**
- Deleted the `enforce_max_output_size` function and related logic to simplify compaction input handling.
- **Added Max File Size Option in `parquet.rs`:**
- Introduced `max_file_size` in `WriteOptions` to control the maximum size of output files.
- **Refactored Indexer Management in `parquet/writer.rs`:**
- Changed `current_indexer` from an `Option` to a direct `Indexer` type.
- Implemented `roll_to_next_file` to handle file transitions when exceeding `max_file_size`.
- Simplified indexer initialization and management logic.
- **Refactored SST File Handling**:
- Introduced `FilePathProvider` trait and its implementations (`WriteCachePathProvider`, `RegionFilePathFactory`) to manage SST and index file paths.
- Updated `AccessLayer`, `WriteCache`, and `ParquetWriter` to use `FilePathProvider` for path management.
- Modified `SstWriteRequest` and `SstUploadRequest` to use path providers instead of direct paths.
- Files affected: `access_layer.rs`, `write_cache.rs`, `parquet.rs`, `writer.rs`.
- **Enhanced Indexer Management**:
- Replaced `IndexerBuilder` with `IndexerBuilderImpl` and made it async to support dynamic indexer creation.
- Updated `ParquetWriter` to handle multiple indexers and file IDs.
- Files affected: `index.rs`, `parquet.rs`, `writer.rs`.
- **Removed Redundant File ID Handling**:
- Removed `file_id` from `SstWriteRequest` and `CompactionOutput`.
- Updated related logic to dynamically generate file IDs where necessary.
- Files affected: `compaction.rs`, `flush.rs`, `picker.rs`, `twcs.rs`, `window.rs`.
- **Test Adjustments**:
- Updated tests to align with new path and indexer management.
- Introduced `FixedPathProvider` and `NoopIndexBuilder` for testing purposes.
- Files affected: `sst_util.rs`, `version_util.rs`, `parquet.rs`.
* chore: rebase main
* feat/multiple-compaction-output:
### Add Benchmarking and Refactor Compaction Logic
- **Benchmarking**: Added a new benchmark `run_bench` in `Cargo.toml` and implemented benchmarks in `benches/run_bench.rs` using Criterion for `find_sorted_runs` and `reduce_runs` functions.
- **Compaction Module Enhancements**:
- Made `run.rs` public and refactored the `Ranged` and `Item` traits to be public.
- Simplified the logic in `find_sorted_runs` and `reduce_runs` by removing `MergeItems` and related functions.
- Introduced `find_overlapping_items` for identifying overlapping items.
- **Code Cleanup**: Removed redundant code and tests related to `MergeItems` in `run.rs`.
* feat/multiple-compaction-output:
### Enhance Compaction Logic and Add Benchmarks
- **Compaction Logic Improvements**:
- Updated `reduce_runs` function in `src/mito2/src/compaction/run.rs` to remove the target parameter and improve the logic for selecting files to merge based on minimum penalty.
- Enhanced `find_overlapping_items` to handle unsorted inputs and improve overlap detection efficiency.
- **Benchmark Enhancements**:
- Added `bench_find_overlapping_items` in `src/mito2/benches/run_bench.rs` to benchmark the new `find_overlapping_items` function.
- Extended existing benchmarks to include larger data sizes.
- **Testing Enhancements**:
- Updated tests in `src/mito2/src/compaction/run.rs` to reflect changes in `reduce_runs` and added new tests for `find_overlapping_items`.
- **Logging and Debugging**:
- Improved logging in `src/mito2/src/compaction/twcs.rs` to provide more detailed information about compaction decisions.
* feat/multiple-compaction-output:
### Refactor and Enhance Compaction Logic
- **Refactor `find_overlapping_items` Function**: Changed the function signature to accept slices instead of mutable vectors in `run.rs`.
- **Rename and Update Struct Fields**: Renamed `penalty` to `size` in `SortedRun` struct and updated related logic in `run.rs`.
- **Enhance `reduce_runs` Function**: Improved logic to sort runs by size and limit probe runs to 100 in `run.rs`.
- **Add `merge_seq_files` Function**: Introduced a new function `merge_seq_files` in `run.rs` for merging sequential files.
- **Modify `TwcsPicker` Logic**: Updated the compaction logic to use `merge_seq_files` when only one run is found in `twcs.rs`.
- **Remove `enforce_file_num` Function**: Deleted the `enforce_file_num` function and its related test cases in `twcs.rs`.
* feat/multiple-compaction-output:
### Enhance Compaction Logic and Testing
- **Add `merge_seq_files` Functionality**: Implemented the `merge_seq_files` function in `run.rs` to optimize file merging based on scoring systems. Updated
benchmarks in `run_bench.rs` to include `bench_merge_seq_files`.
- **Improve Compaction Strategy in `twcs.rs`**: Modified the compaction logic to handle file merging more effectively, considering file size and overlap.
- **Update Tests**: Enhanced test coverage in `compaction_test.rs` and `append_mode_test.rs` to validate new compaction logic and file merging strategies.
- **Remove Unused Function**: Deleted `new_file_handles` from `test_util.rs` as it was no longer needed.
* feat/multiple-compaction-output:
### Refactor TWCS Compaction Options
- **Refactor Compaction Logic**: Simplified the TWCS compaction logic by replacing multiple parameters (`max_active_window_runs`, `max_active_window_files`, `max_inactive_window_runs`, `max_inactive_window_files`) with a single `trigger_file_num` parameter in `picker.rs`, `twcs.rs`, and `options.rs`.
- **Update Tests**: Adjusted test cases to reflect the new compaction logic in `append_mode_test.rs`, `compaction_test.rs`, `filter_deleted_test.rs`, `merge_mode_test.rs`, and various test files under `tests/cases`.
- **Modify Engine Options**: Updated engine option keys to use `trigger_file_num` in `mito_engine_options.rs` and `region_request.rs`.
- **Fuzz Testing**: Updated fuzz test generators and translators to accommodate the new compaction parameter in `alter_expr.rs` and related files.
This refactor aims to streamline the compaction configuration by reducing the number of parameters and simplifying the codebase.
* chore: add trailing space
* fix license header
* feat/revise-compaction-picker:
**Limit File Processing and Optimize Merge Logic in `run.rs`**
- Introduced a limit to process a maximum of 100 files in `merge_seq_files` to control time complexity.
- Adjusted logic to calculate `target_size` and iterate over files using the limited set of files.
- Updated scoring calculations to use the limited file set, ensuring efficient file merging.
* feat/revise-compaction-picker:
### Add Compaction Metrics and Remove Debug Logging
- **Compaction Metrics**: Introduced new histograms `COMPACTION_INPUT_BYTES` and `COMPACTION_OUTPUT_BYTES` to track compaction input and output file sizes in `metrics.rs`. Updated `compactor.rs` to observe these metrics during the compaction process.
- **Logging Cleanup**: Removed debug logging of file ranges during the merge process in `twcs.rs`.
* feat/revise-compaction-picker:
## Enhance Compaction Logic and Metrics
- **Compaction Logic Improvements**:
- Added methods `input_file_size` and `output_file_size` to `MergeOutput` in `compactor.rs` to streamline file size calculations.
- Updated `Compactor` implementation to use these methods for metrics tracking.
- Modified `Ranged` trait logic in `run.rs` to improve range comparison.
- Enhanced test cases in `run.rs` to reflect changes in compaction logic.
- **Metrics Enhancements**:
- Changed `COMPACTION_INPUT_BYTES` and `COMPACTION_OUTPUT_BYTES` from histograms to counters in `metrics.rs` for better performance tracking.
- **Debugging and Logging**:
- Added detailed logging for compaction pick results in `twcs.rs`.
- Implemented custom `Debug` trait for `FileMeta` in `file.rs` to improve debugging output.
- **Testing Enhancements**:
- Added new test `test_compaction_overlapping_files` in `compaction_test.rs` to verify compaction behavior with overlapping files.
- Updated `merge_mode_test.rs` to reflect changes in file handling during scans.
* feat/revise-compaction-picker:
### Update `FileHandle` Debug Implementation
- **Refactor Debug Output**: Simplified the `fmt::Debug` implementation for `FileHandle` in `src/mito2/src/sst/file.rs` by consolidating multiple fields into a single `meta` field using `meta_ref()`.
- **Atomic Operations**: Updated the `deleted` field to use atomic loading with `Ordering::Relaxed`.
* Trigger CI
* feat/revise-compaction-picker:
**Update compaction logic and default options**
- **`twcs.rs`**: Enhanced logging for compaction pick results by improving the formatting for better readability.
- **`options.rs`**: Modified the default `max_output_file_size` in `TwcsOptions` from 2GB to 512MB to optimize file handling and performance.
* feat/revise-compaction-picker:
Refactor `find_overlapping_items` to use an external result vector
- Updated `find_overlapping_items` in `src/mito2/src/compaction/run.rs` to accept a mutable result vector instead of returning a new vector, improving memory efficiency.
- Modified benchmarks in `src/mito2/benches/bench_compaction_picker.rs` to accommodate the new function signature.
- Adjusted tests in `src/mito2/src/compaction/run.rs` to use the updated function signature, ensuring correct functionality with the new approach.
* feat/revise-compaction-picker:
Improve file merging logic in `run.rs`
- Refactor the loop logic in `merge_seq_files` to simplify the iteration over file groups.
- Adjust the range for `end_idx` to include the endpoint, allowing for more flexible group selection.
- Remove the condition that skips groups with only one file, enabling more comprehensive processing of file sequences.
* feat/revise-compaction-picker:
Enhance `find_overlapping_items` with `SortedRun` and Update Tests
- Refactor `find_overlapping_items` in `src/mito2/src/compaction/run.rs` to utilize the `SortedRun` struct for improved efficiency and clarity.
- Introduce a `sorted` flag in `SortedRun` to optimize sorting operations.
- Update test cases in `src/mito2/benches/bench_compaction_picker.rs` to accommodate changes in `find_overlapping_items` by using `SortedRun`.
- Add `From<Vec<T>>` implementation for `SortedRun` to facilitate easy conversion from vectors.
* feat/revise-compaction-picker:
**Enhancements in `compaction/run.rs`:**
- Added `ReadableSize` import to handle size calculations.
- Modified the logic in `merge_seq_files` to clamp the calculated target size to a maximum of 2GB when `max_file_size` is not provided.
* feat/revise-compaction-picker: Add Default Max Output Size Constant for Compaction
Introduce DEFAULT_MAX_OUTPUT_SIZE constant to define the default maximum compaction output file size as 2GB. Refactor the merge_seq_files function to utilize this constant, ensuring consistent and maintainable code for handling file size limits during compaction.
* add benchmark for splitting according to time partition
* feat/write-to-multiple-time-partitions:
**Enhancements to Bulk Processing and Time Partitioning**
- **`part.rs`**: Added `Snafu` to imports and introduced `timestamp_index` in `BulkPart` struct. Implemented `timestamps` method for accessing timestamp columns.
- **`simple_bulk_memtable.rs`**: Updated tests to include `timestamp_index` initialization.
- **`time_partition.rs`**: Enhanced `TimePartition` to support partial writes with `write_record_batch_partial`. Implemented `split_record_batch` for filtering records by timestamp range. Added comprehensive tests for `split_record_batch`.
- **`handle_bulk_insert.rs`**: Modified to retrieve timestamp index and column together, updating `BulkPart` initialization with `timestamp_index`.
* feat/write-to-multiple-time-partitions:
### Enhance Time Partitioning Logic
- **`time_partition.rs`**:
- Introduced `HashSet` for efficient partition management.
- Refactored `write_bulk` to handle multiple partitions and added `find_partitions_by_time_range` for identifying existing and missing partitions.
- Updated `get_or_create_time_partition` to manage partition creation.
- Added comprehensive tests for partition finding logic, covering various scenarios including overlapping and non-overlapping time ranges.
- **Tests**:
- Added `test_find_partitions_by_time_range` to validate new partitioning logic.
- Updated `test_split_record_batch` to ensure correct record batch splitting behavior.
* feat/write-to-multiple-time-partitions:
### Enhance Time Partitioning and Testing in `time_partition.rs`
- **Time Partitioning Enhancements**:
- Updated `split_record_batch` to handle multiple timestamp units (`Second`, `Millisecond`, `Microsecond`, `Nanosecond`) by matching on `DataType`.
- Improved filtering logic for timestamp arrays to support various time units.
- **Testing Enhancements**:
- Added `test_write_bulk` to verify writing across multiple partitions and scenarios in `time_partition.rs`.
- Updated `test_split_record_batch` to use `TimestampMillisecondArray` for testing timestamp partitioning.
- **Imports and Dependencies**:
- Added necessary imports for new timestamp array types and testing utilities.
* feat/write-to-multiple-time-partitions:
### Refactor and Enhance Time Partition Filtering
- **Refactor Filtering Logic**: Consolidated the filtering logic for timestamp arrays using macros in `time_partition.rs` and `bench_filter_time_partition.rs`. This reduces code duplication and improves maintainability.
- **Enhance `BulkPart` Struct**: Made fields in `BulkPart` public to facilitate easier access and manipulation in `memtable.rs` and `part.rs`.
- **Rename Function**: Renamed `split_record_batch` to `filter_record_batch` for clarity in `time_partition.rs` and `bench_filter_time_partition.rs`.
- **Add Feature Flag**: Introduced `int_roundings` feature in `lib.rs` to support new functionality.
* refactor tests
* feat/write-to-multiple-time-partitions:
Improve timestamp handling in `time_partition.rs`
- Enhanced safety comments for timestamp conversion to ensure clarity.
- Modified logic to prevent overflow by using `div_euclid` for `bulk_start_sec` and `bulk_end_sec` calculations.
- Adjusted the `filter_map` logic to correctly compute timestamps using `start_sec` and `part_duration_sec`.
* feat/write-to-multiple-time-partitions:
**Refactor timestamp handling and add utility function**
- **Refactor `time_partition.rs`:** Simplified timestamp handling by replacing direct type access with a utility function to retrieve the timestamp unit. Improved error handling for timestamp conversion.
- **Enhance `metadata.rs`:** Added `time_index_type` function to `RegionMetadata` to retrieve the timestamp type of the time index column, ensuring safer and more readable code.
* feat/write-to-multiple-time-partitions:
Refactor time partition variable names in `time_partition.rs`
- Renamed variables for clarity: `bulk_start_sec` to `start_bucket` and `bulk_end_sec` to `end_bucket`.
- Updated related logic to use new variable names for improved readability and maintainability.
* feat/write-to-multiple-time-partitions:
**Refactor variable names in `time_partition.rs`**
- Updated variable names from `matching` and `missing` to `matchings` and `missings` for clarity and consistency.
- Modified function calls and loop iterations to align with the new variable names.
- Affected file: `src/mito2/src/memtable/time_partition.rs`
* feat/write-to-multiple-time-partitions:
### Refactor variable names in `time_partition.rs`
- Updated variable names for clarity in `time_partition.rs`:
- Renamed `matchings` to `matching_parts`
- Renamed `missings` to `missing_parts`
- Adjusted logic to use new variable names in methods `find_partitions_by_time_range` and `write_record_batch`.
* feat/write-to-multiple-time-partitions:
### Enhance Time Partition Handling
- **`time_partition.rs`**:
- Added `ArrayRef` to handle timestamp arrays, improving the partitioning logic by allowing more efficient timestamp range checks.
- Enhanced `find_partitions_by_time_range` to support sparse data and handle different timestamp units (`Second`, `Millisecond`, `Microsecond`, `Nanosecond`).
- Updated test cases to cover new scenarios, including sparse data and edge cases, ensuring robustness of partition handling.
---------
Co-authored-by: Lei <lei@Leis-MacBook-Pro.local>
* feat: introduce `PrimaryKeyEncoding`
* fix: fix unit tests
* chore: add empty line
* test: add unit tests
* chore: fmt code
* refactor: introduce new codec trait to support various encoding
* fix: fix unit tests
* chore: update sqlness result
* chore: apply suggestions from CR
* chore: apply suggestions from CR
* feat: add update_mode to region options
* test: add test
* feat: last not null iter
* feat: time series last not null
* feat: partition tree update mode
* feat: partition tree
* fix: last not null iter slice
* test: add test for compaction
* test: use second resolution
* style: fix clippy
* chore: merge two lines
Co-authored-by: Jeremyhi <jiachun_feng@proton.me>
* chore: address CR comments
* refactor: UpdateMode -> MergeMode
* refactor: LastNotNull -> LastNonNull
* chore: return None earlier
* feat: validate region options
make merge mode optional and use default while it is None
* test: fix tests
---------
Co-authored-by: Jeremyhi <jiachun_feng@proton.me>