* feat: struct value
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: update for proto module
* feat: wip struct type
* feat: implement more vector operations
* feat: make datatype and api
* feat: reoslve some compilation issues
* feat: resolve all compilation issues
* chore: format update
* test: resolve tests
* test: test and refactor value-to-pb
* feat: add more tests and fix for value types
* chore: remove dbg
* feat: test and fix iterator
* fix: resolve struct_type issue
* refactor: use vec for struct items
* chore: update proto to main branch
* refactor: address some of review issues
* refactor: update for further review
* Add validation on new methods
* feat: update struct/list json serialization
* refactor: reimplement get in struct_vector
* refactor: struct vector functions
* refactor: fix lint issue
* refactor: address review comments
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: use correct projection index for old format
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: remove allow dead_code from format
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: check and convert old format to flat format
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: sub primary key num from projection
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: always convert the batch in FlatRowGroupReader
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: Change &Option<&[]> to Option<&[]>
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: only build arrow schema once
adds a method flat_sst_arrow_schema_column_num() to get the field num
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: Handle flat format and old format separately
Adds two structs ParquetFlat and ParquetPrimaryKeyToFlat.
ParquetPrimaryKeyToFlat delegates stats and projection to the
PrimaryKeyReadFormat.
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: handle non string tag correctly
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: do not register file cache twice
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: clean temp files
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: add rows and bytes to flush success log
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: convert format in memtable
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: add compaction flag to ScanInput
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: compaction should use old format for sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: merge schema use old format in sparse encoding
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: reads legacy format but not convert if skip_auto_convert
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: suppport sparse encoding in bulk parts
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: add datafusion-postgres dependency
* refactor: move and include pg_catalog udfs
* chore: update upstream
* feat: register table function pg_get_keywords
* feat: bridge CatalogInfo for our CatalogManager
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: convert pg_catalog table to our system table
* feat: bridge system catalog with datafusion-postgres
Signed-off-by: Ning Sun <sunning@greptime.com>
* feat: add more udfs
* feat: add compatibility rewriter to postgres handler
* fix: various fix
* fmt: fix
* fix: use functions from pg_catalog library
* fmt
* fix: sqlness runner
Signed-off-by: Ning Sun <sunning@greptime.com>
* test: adopt arrow 56.0 to 56.1 memory size change
* fix: add additional udfs
* chore: format
* refactor: return None when creating system table failed
Signed-off-by: Ning Sun <sunning@greptime.com>
* chore: provide safety comments about expect usage
---------
Signed-off-by: Ning Sun <sunning@greptime.com>
fix/disable-parquet-stats-truncate:
- **Update `memcomparable` Dependency**: Switched from crates.io to a Git repository for `memcomparable` in `Cargo.lock`, `mito-codec/Cargo.toml`, and removed it from `mito2/Cargo.toml`.
- **Enhance Parquet Writer Properties**: Added `set_statistics_truncate_length` and `set_column_index_truncate_length` to `WriterProperties` in `parquet.rs`, `bulk/part.rs`, `partition_tree/data.rs`, and `writer.rs`.
- **Add Test for Corrupt Scan**: Introduced a new test module `scan_corrupt.rs` in `mito2/src/engine` to verify handling of corrupt data.
- **Update Test Data**: Modified test data in `flush.rs` to reflect changes in file sizes and sequences.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
### Commit Message
Refactor `get_last_seq_num` Method Across Engines
- **Change Return Type**: Updated the `get_last_seq_num` method to return `Result<SequenceNumber, BoxedError>` instead of `Result<Option<SequenceNumber>, BoxedError>` in the following files:
- `src/datanode/src/tests.rs`
- `src/file-engine/src/engine.rs`
- `src/metric-engine/src/engine.rs`
- `src/metric-engine/src/engine/read.rs`
- `src/mito2/src/engine.rs`
- `src/query/src/optimizer/test_util.rs`
- `src/store-api/src/region_engine.rs`
- **Enhance Region Edit Handling**: Modified `RegionWorkerLoop` in `src/mito2/src/worker/handle_manifest.rs` to update file sequences during region edits.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* add committed_sequence to RegionEdit
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
### Commit Message
Refactor sequence retrieval method
- **Renamed Method**: Changed `get_last_seq_num` to `get_committed_sequence` across multiple files to better reflect its purpose of retrieving the latest committed sequence.
- Affected files: `tests.rs`, `engine.rs` in `file-engine`, `metric-engine`, `mito2`, `test_util.rs`, and `region_engine.rs`.
- **Removed Unused Struct**: Deleted `RegionSequencesRequest` struct from `region_request.rs` as it is no longer needed.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
**Add Committed Sequence Handling in Region Engine**
- **`engine.rs`**: Introduced a new test module `bump_committed_sequence_test` to verify committed sequence handling.
- **`bump_committed_sequence_test.rs`**: Added a test to ensure the committed sequence is correctly updated and persisted across region reopenings.
- **`action.rs`**: Updated `RegionManifest` and `RegionManifestBuilder` to include `committed_sequence` for tracking.
- **`manager.rs`**: Adjusted manifest size assertion to accommodate new committed sequence data.
- **`opener.rs`**: Implemented logic to override committed sequence during region opening.
- **`version.rs`**: Added `set_committed_sequence` method to update the committed sequence in `VersionControl`.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
**Enhance `test_bump_committed_sequence` in `bump_committed_sequence_test.rs`**
- Updated the test to include row operations using `build_rows`, `put_rows`, and `rows_schema` to verify the committed sequence behavior.
- Adjusted assertions to reflect changes in committed sequence after row operations and region edits.
- Added comments to clarify the expected behavior of committed sequence after reopening the region and replaying the WAL.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
**Enhance Region Sequence Management**
- **`bump_committed_sequence_test.rs`**: Updated test to handle region reopening and sequence management, ensuring committed sequences are correctly set and verified after edits.
- **`opener.rs`**: Improved committed sequence handling by overriding it only if the manifest's sequence is greater than the replayed sequence. Added logging for mutation sequence replay.
- **`region_write_ctx.rs`**: Modified `push_mutation` and `push_bulk` methods to adopt sequence numbers from parameters, enhancing sequence management during write operations.
- **`handle_write.rs`**: Updated `RegionWorkerLoop` to pass sequence numbers in `push_bulk` and `push_mutation` methods, ensuring consistent sequence handling.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* chore/update-sequence-on-region-edit:
### Remove Debug Logging from `opener.rs`
- Removed debug logging for mutation sequences in `opener.rs` to clean up the output and improve performance.
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
---------
Signed-off-by: Lei, HUANG <mrsatangel@gmail.com>
* feat: support flat format in SeqScan
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support flat format in unordered scan
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support parallel read for flat format in SeqScan
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: rename flat DedupReader to FlatDedupReader
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: address review comments
It also precomputes the input arrow schema
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: implements method to write flat batch for ParquetWriter
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: add update method for flat RecordBatch in Indexer
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: calls indexer to write flat batch in ParquetWriter
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: handle empty projection for flat format
Signed-off-by: evenyag <realevenyag@gmail.com>
* fix: eval array in precise_filter_flat
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: cache column lookup result in inverted indexer
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: add test
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support dict type in dense codec
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: remove read part in test as it need modifying the reader
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: support dictionary type in other methods for dense codec
Signed-off-by: evenyag <realevenyag@gmail.com>
* refactor: fulltext use string array directly
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: cache the cloned page bytes
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: cache the whole row group pages
The opendal reader may merge IO requests so the pages of different
columns can share the same Bytes.
When we use a per-column page cache, the page cache may still referencing
the whole Bytes after eviction if there are other columns in the cache that
share the same Bytes.
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: check possible max byte range and copy pages if needed
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: always copy pages
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: returns the copied pages
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: compute cache size by MERGE_GAP
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: align to buf size
Signed-off-by: evenyag <realevenyag@gmail.com>
* feat: aligh to 2MB
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: remove unused code
Signed-off-by: evenyag <realevenyag@gmail.com>
* style: fix clippy
Signed-off-by: evenyag <realevenyag@gmail.com>
* chore: fix typo
Signed-off-by: evenyag <realevenyag@gmail.com>
* test: fix parquet read with cache test
Signed-off-by: evenyag <realevenyag@gmail.com>
---------
Signed-off-by: evenyag <realevenyag@gmail.com>