update CHANGELOG

Fix format (#2852 )
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-03-18 11:10:43 +00:00 · 2026-03-18 10:41:00 +01:00 · 2026-03-16 10:43:39 +01:00 · 2026-03-12 16:47:11 +01:00 · 2026-03-12 16:44:38 +01:00 · 2026-02-20 16:56:56 +01:00
46 changed files with 1407 additions and 210 deletions
--- a/.claude/skills/rationalize-deps/SKILL.md
+++ b/.claude/skills/rationalize-deps/SKILL.md
@@ -0,0 +1,125 @@
+---
+name: rationalize-deps
+description: Analyze Cargo.toml dependencies and attempt to remove unused features to reduce compile times and binary size
+---
+
+# Rationalize Dependencies
+
+This skill analyzes Cargo.toml dependencies to identify and remove unused features.
+
+## Overview
+
+Many crates enable features by default that may not be needed. This skill:
+1. Identifies dependencies with default features enabled
+2. Tests if `default-features = false` works
+3. Identifies which specific features are actually needed
+4. Verifies compilation after changes
+
+## Step 1: Identify the target
+
+Ask the user which crate(s) to analyze:
+- A specific crate name (e.g., "tokio", "serde")
+- A specific workspace member (e.g., "quickwit-search")
+- "all" to scan the entire workspace
+
+## Step 2: Analyze current dependencies
+
+For the workspace Cargo.toml (`quickwit/Cargo.toml`), list dependencies that:
+- Do NOT have `default-features = false`
+- Have default features that might be unnecessary
+
+Run: `cargo tree -p <crate> -f "{p} {f}" --edges features` to see what features are actually used.
+
+## Step 3: For each candidate dependency
+
+### 3a: Check the crate's default features
+
+Look up the crate on crates.io or check its Cargo.toml to understand:
+- What features are enabled by default
+- What each feature provides
+
+Use: `cargo metadata --format-version=1 | jq '.packages[] | select(.name == "<crate>") | .features'`
+
+### 3b: Try disabling default features
+
+Modify the dependency in `quickwit/Cargo.toml`:
+
+From:
+```toml
+some-crate = { version = "1.0" }
+```
+
+To:
+```toml
+some-crate = { version = "1.0", default-features = false }
+```
+
+### 3c: Run cargo check
+
+Run: `cargo check --workspace` (or target specific packages for faster feedback)
+
+If compilation fails:
+1. Read the error messages to identify which features are needed
+2. Add only the required features explicitly:
+   ```toml
+   some-crate = { version = "1.0", default-features = false, features = ["needed-feature"] }
+   ```
+3. Re-run cargo check
+
+### 3d: Binary search for minimal features
+
+If there are many default features, use binary search:
+1. Start with no features
+2. If it fails, add half the default features
+3. Continue until you find the minimal set
+
+## Step 4: Document findings
+
+For each dependency analyzed, report:
+- Original configuration
+- New configuration (if changed)
+- Features that were removed
+- Any features that are required
+
+## Step 5: Verify full build
+
+After all changes, run:
+```bash
+cargo check --workspace --all-targets
+cargo test --workspace --no-run
+```
+
+## Common Patterns
+
+### Serde
+Often only needs `derive`:
+```toml
+serde = { version = "1.0", default-features = false, features = ["derive", "std"] }
+```
+
+### Tokio
+Identify which runtime features are actually used:
+```toml
+tokio = { version = "1.0", default-features = false, features = ["rt-multi-thread", "macros", "sync"] }
+```
+
+### Reqwest
+Often doesn't need all TLS backends:
+```toml
+reqwest = { version = "0.11", default-features = false, features = ["rustls-tls", "json"] }
+```
+
+## Rollback
+
+If changes cause issues:
+```bash
+git checkout quickwit/Cargo.toml
+cargo check --workspace
+```
+
+## Tips
+
+- Start with large crates that have many default features (tokio, reqwest, hyper)
+- Use `cargo bloat --crates` to identify large dependencies
+- Check `cargo tree -d` for duplicate dependencies that might indicate feature conflicts
+- Some features are needed only for tests - consider using `[dev-dependencies]` features
--- a/.claude/skills/simple-pr/SKILL.md
+++ b/.claude/skills/simple-pr/SKILL.md
@@ -0,0 +1,60 @@
+---
+name: simple-pr
+description: Create a simple PR from staged changes with an auto-generated commit message
+disable-model-invocation: true
+---
+
+# Simple PR
+
+Follow these steps to create a simple PR from staged changes:
+
+## Step 1: Check workspace state
+
+Run: `git status`
+
+Verify that all changes have been staged (no unstaged changes). If there are unstaged changes, abort and ask the user to stage their changes first with `git add`.
+
+Also verify that we are on the `main` branch. If not, abort and ask the user to switch to main first.
+
+## Step 2: Ensure main is up to date
+
+Run: `git pull origin main`
+
+This ensures we're working from the latest code.
+
+## Step 3: Review staged changes
+
+Run: `git diff --cached`
+
+Review the staged changes to understand what the PR will contain.
+
+## Step 4: Generate commit message
+
+Based on the staged changes, generate a concise commit message (1-2 sentences) that describes the "why" rather than the "what".
+
+Display the proposed commit message to the user and ask for confirmation before proceeding.
+
+## Step 5: Create a new branch
+
+Get the git username: `git config user.name | tr ' ' '-' | tr '[:upper:]' '[:lower:]'`
+
+Create a short, descriptive branch name based on the changes (e.g., `fix-typo-in-readme`, `add-retry-logic`, `update-deps`).
+
+Create and checkout the branch: `git checkout -b {username}/{short-descriptive-name}`
+
+## Step 6: Commit changes
+
+Commit with the message from step 3:
+```
+git commit -m "{commit-message}"
+```
+
+## Step 7: Push and open a PR
+
+Push the branch and open a PR:
+```
+git push -u origin {branch-name}
+gh pr create --title "{commit-message-title}" --body "{longer-description-if-needed}"
+```
+
+Report the PR URL to the user when complete.
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,56 @@
+Tantivy 0.26 (Unreleased)
+================================
+
+## Bugfixes
+- Align float query coercion during search with the columnar coercion rules [#2692](https://github.com/quickwit-oss/tantivy/pull/2692)(@fulmicoton)
+- Fix lenient elastic range queries with trailing closing parentheses [#2816](https://github.com/quickwit-oss/tantivy/pull/2816)(@evance-br)
+- Fix regex queries between parentheses [#2818](https://github.com/quickwit-oss/tantivy/pull/2818)(@Darkheir)
+- Fix boolean query incorrectly dropping documents when `AllScorer` is present [#2760](https://github.com/quickwit-oss/tantivy/pull/2760)(@mdashti)
+- Fix `minimum_should_match` bug with `AllScorer` [#2774](https://github.com/quickwit-oss/tantivy/pull/2774)(@trinity-1686a)
+- Fix intersection `seek()` advancing below current doc id [#2812](https://github.com/quickwit-oss/tantivy/pull/2812)(@fulmicoton)
+- Fix union performance regression caused by `partial_cmp` in hot path [#2790](https://github.com/quickwit-oss/tantivy/pull/2790)(@PSeitz)
+- Fix phrase query prefixed with `*` [#2751](https://github.com/quickwit-oss/tantivy/pull/2751)(@Darkheir)
+- Fix `vint` buffer overflow during index creation [#2778](https://github.com/quickwit-oss/tantivy/pull/2778)(@rebasedming)
+- Fix integer overflow in `ExpUnrolledLinkedList` for large datasets [#2735](https://github.com/quickwit-oss/tantivy/pull/2735)(@mdashti)
+- Fix `Column::first` parameter type [#2792](https://github.com/quickwit-oss/tantivy/pull/2792)(@ChangRui-Ryan)
+- Fix integer overflow in segment sorting and merge policy truncation [#2846](https://github.com/quickwit-oss/tantivy/pull/2846)(@anaslimem)
+- Fix strict comparison in `TopNComputer` for duplicate sort keys [#2777](https://github.com/quickwit-oss/tantivy/pull/2777)(@stuhood)
+- Fix merging of intermediate aggregation results [#2719](https://github.com/quickwit-oss/tantivy/pull/2719)(@PSeitz)
+
+## Features/Improvements
+- **Aggregation**
+    - Add filter aggregation [#2711](https://github.com/quickwit-oss/tantivy/pull/2711)(@mdashti)
+    - Add include/exclude filtering for term aggregations [#2717](https://github.com/quickwit-oss/tantivy/pull/2717)(@PSeitz)
+    - Add public accessors for intermediate aggregation results [#2829](https://github.com/quickwit-oss/tantivy/pull/2829)(@congx4)
+    - Replace HyperLogLog++ with Apache DataSketches HLL for cardinality aggregation [#2837](https://github.com/quickwit-oss/tantivy/pull/2837)(@congx4)
+    - Use Java-compatible DDSketch binary encoding for percentiles [#2842](https://github.com/quickwit-oss/tantivy/pull/2842)(@congx4)
+    - Export fields of `PercentileValuesVecEntry` [#2833](https://github.com/quickwit-oss/tantivy/pull/2833)(@mdumandag)
+- **Fast Fields**
+    - Add fast field fallback for `TermQuery` when the field is not indexed [#2693](https://github.com/quickwit-oss/tantivy/pull/2693)(@PSeitz-dd)
+    - Add fast field support for `Bytes` values [#2830](https://github.com/quickwit-oss/tantivy/pull/2830)(@mdashti)
+- Add regex query grammar [#2677](https://github.com/quickwit-oss/tantivy/pull/2677)(@Darkheir)
+- Add erased `SortKeyComputer` for sorting on column types unknown until runtime [#2770](https://github.com/quickwit-oss/tantivy/pull/2770)(@stuhood)
+- Add natural-order-with-none-highest support in `TopDocs::order_by` [#2780](https://github.com/quickwit-oss/tantivy/pull/2780)(@stuhood)
+- Move stemming behing `stemmer` feature flag [#2791](https://github.com/quickwit-oss/tantivy/pull/2791)(@fulmicoton)
+- Make posting list serializer public [#2835](https://github.com/quickwit-oss/tantivy/pull/2835)(@PSeitz)
+- Make `DeleteMeta`, `advance_deletes`, `with_max_doc`, and `delete_queue` public [#2765](https://github.com/quickwit-oss/tantivy/pull/2765) [#2766](https://github.com/quickwit-oss/tantivy/pull/2766)(@philippemnoel)
+- Make `Language` hashable [#2763](https://github.com/quickwit-oss/tantivy/pull/2763)(@philippemnoel)
+- Improve `space_usage` reporting for JSON fields and columnar data [#2761](https://github.com/quickwit-oss/tantivy/pull/2761)(@PSeitz-dd)
+
+## Performance
+- **Aggregation**
+    - Speed up aggregations by using one collector per request instead of one per bucket, and add `PagedTermMap` for faster term aggregations [#2759](https://github.com/quickwit-oss/tantivy/pull/2759)(@PSeitz-dd)
+    - Optimize low-cardinality term aggregations by using a `Vec` instead of a `HashMap` [#2740](https://github.com/quickwit-oss/tantivy/pull/2740)(@fulmicoton-dd)
+- Optimize `ExistsQuery` for a high number of dynamic columns [#2694](https://github.com/quickwit-oss/tantivy/pull/2694)(@PSeitz-dd)
+- Add lazy scorers to stop score evaluation early when a doc won't reach the top-K threshold [#2726](https://github.com/quickwit-oss/tantivy/pull/2726)(@fulmicoton)
+- Add `DocSet::cost()` and use it to order scorers in intersections [#2707](https://github.com/quickwit-oss/tantivy/pull/2707)(@PSeitz)
+- Add `collect_block` support for collector wrappers [#2727](https://github.com/quickwit-oss/tantivy/pull/2727)(@stuhood)
+- Optimize saturated posting lists by replacing them with `AllScorer` in boolean queries [#2745](https://github.com/quickwit-oss/tantivy/pull/2745)(@fulmicoton)
+- Add `seek_into_the_danger_zone` on `DocSet` for more efficient intersections [#2538](https://github.com/quickwit-oss/tantivy/pull/2538)(@PSeitz)
+- Skip column traversal in `RangeDocSet` when query range does not overlap with column bounds [#2783](https://github.com/quickwit-oss/tantivy/pull/2783)(@ChangRui-Ryan)
+- Speed up exclude queries by supporting multiple excluded `DocSet`s without intermediate union [#2825](https://github.com/quickwit-oss/tantivy/pull/2825)(@PSeitz)
+- Deduplicate queries in query parser [#2698](https://github.com/quickwit-oss/tantivy/pull/2698)(@PSeitz-dd)
+
 Tantivy 0.25
 ================================

--- a/Cargo.toml
+++ b/Cargo.toml
@@ -15,7 +15,7 @@ rust-version = "1.85"
 exclude = ["benches/*.json", "benches/*.txt"]

 [dependencies]
-oneshot = "0.1.7"
+oneshot = "0.1.13"
 base64 = "0.22.0"
 byteorder = "1.4.3"
 crc32fast = "1.3.2"
@@ -47,7 +47,7 @@ rustc-hash = "2.0.0"
 thiserror = "2.0.1"
 htmlescape = "0.3.1"
 fail = { version = "0.5.0", optional = true }
-time = { version = "0.3.35", features = ["serde-well-known"] }
+time = { version = "0.3.47", features = ["serde-well-known"] }
 smallvec = "1.8.0"
 rayon = "1.5.2"
 lru = "0.16.3"
@@ -64,8 +64,8 @@ query-grammar = { version = "0.25.0", path = "./query-grammar", package = "tanti
 tantivy-bitpacker = { version = "0.9", path = "./bitpacker" }
 common = { version = "0.10", path = "./common/", package = "tantivy-common" }
 tokenizer-api = { version = "0.6", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
-sketches-ddsketch = { version = "0.3.0", features = ["use_serde"] }
-hyperloglogplus = { version = "0.4.1", features = ["const-loop"] }
+sketches-ddsketch = { git = "https://github.com/quickwit-oss/rust-sketches-ddsketch.git", rev = "555caf1", features = ["use_serde"] }
+datasketches = "0.2.0"
 futures-util = { version = "0.3.28", optional = true }
 futures-channel = { version = "0.3.28", optional = true }
 fnv = "1.0.7"
@@ -86,7 +86,7 @@ futures = "0.3.21"
 paste = "1.0.11"
 more-asserts = "0.3.1"
 rand_distr = "0.5"
-time = { version = "0.3.10", features = ["serde-well-known", "macros"] }
+time = { version = "0.3.47", features = ["serde-well-known", "macros"] }
 postcard = { version = "1.0.4", features = [
    "use-std",
 ], default-features = false }
@@ -193,3 +193,11 @@ harness = false
 [[bench]]
 name = "str_search_and_get"
 harness = false
+
+[[bench]]
+name = "merge_segments"
+harness = false
+
+[[bench]]
+name = "regex_all_terms"
+harness = false
--- a/benches/merge_segments.rs
+++ b/benches/merge_segments.rs
@@ -0,0 +1,224 @@
+// Benchmarks segment merging
+//
+// Notes:
+// - Input segments are kept intact (no deletes / no IndexWriter merge).
+// - Output is written to a `NullDirectory` that discards all files except
+//  fieldnorms (needed for merging).
+
+use std::collections::HashMap;
+use std::io::{self, Write};
+use std::path::{Path, PathBuf};
+use std::sync::{Arc, RwLock};
+
+use binggan::{black_box, BenchRunner};
+use rand::prelude::*;
+use rand::rngs::StdRng;
+use rand::SeedableRng;
+use tantivy::directory::error::{DeleteError, OpenReadError, OpenWriteError};
+use tantivy::directory::{
+    AntiCallToken, Directory, FileHandle, OwnedBytes, TerminatingWrite, WatchCallback, WatchHandle,
+    WritePtr,
+};
+use tantivy::indexer::{merge_filtered_segments, NoMergePolicy};
+use tantivy::schema::{Schema, TEXT};
+use tantivy::{doc, HasLen, Index, IndexSettings, Segment};
+
+#[derive(Clone, Default, Debug)]
+struct NullDirectory {
+    blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
+}
+
+struct NullWriter;
+
+impl Write for NullWriter {
+    fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
+        Ok(buf.len())
+    }
+
+    fn flush(&mut self) -> io::Result<()> {
+        Ok(())
+    }
+}
+
+impl TerminatingWrite for NullWriter {
+    fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
+        Ok(())
+    }
+}
+
+struct InMemoryWriter {
+    path: PathBuf,
+    buffer: Vec<u8>,
+    blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
+}
+
+impl Write for InMemoryWriter {
+    fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
+        self.buffer.extend_from_slice(buf);
+        Ok(buf.len())
+    }
+
+    fn flush(&mut self) -> io::Result<()> {
+        Ok(())
+    }
+}
+
+impl TerminatingWrite for InMemoryWriter {
+    fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
+        let bytes = OwnedBytes::new(std::mem::take(&mut self.buffer));
+        self.blobs.write().unwrap().insert(self.path.clone(), bytes);
+        Ok(())
+    }
+}
+
+#[derive(Debug, Default)]
+struct NullFileHandle;
+impl HasLen for NullFileHandle {
+    fn len(&self) -> usize {
+        0
+    }
+}
+impl FileHandle for NullFileHandle {
+    fn read_bytes(&self, _range: std::ops::Range<usize>) -> io::Result<OwnedBytes> {
+        unimplemented!()
+    }
+}
+
+impl Directory for NullDirectory {
+    fn get_file_handle(&self, path: &Path) -> Result<Arc<dyn FileHandle>, OpenReadError> {
+        if let Some(bytes) = self.blobs.read().unwrap().get(path) {
+            return Ok(Arc::new(bytes.clone()));
+        }
+        Ok(Arc::new(NullFileHandle))
+    }
+
+    fn delete(&self, _path: &Path) -> Result<(), DeleteError> {
+        Ok(())
+    }
+
+    fn exists(&self, _path: &Path) -> Result<bool, OpenReadError> {
+        Ok(true)
+    }
+
+    fn open_write(&self, path: &Path) -> Result<WritePtr, OpenWriteError> {
+        let path_buf = path.to_path_buf();
+        if path.to_string_lossy().ends_with(".fieldnorm") {
+            let writer = InMemoryWriter {
+                path: path_buf,
+                buffer: Vec::new(),
+                blobs: Arc::clone(&self.blobs),
+            };
+            Ok(io::BufWriter::new(Box::new(writer)))
+        } else {
+            Ok(io::BufWriter::new(Box::new(NullWriter)))
+        }
+    }
+
+    fn atomic_read(&self, path: &Path) -> Result<Vec<u8>, OpenReadError> {
+        if let Some(bytes) = self.blobs.read().unwrap().get(path) {
+            return Ok(bytes.as_slice().to_vec());
+        }
+        Err(OpenReadError::FileDoesNotExist(path.to_path_buf()))
+    }
+
+    fn atomic_write(&self, _path: &Path, _data: &[u8]) -> io::Result<()> {
+        Ok(())
+    }
+
+    fn sync_directory(&self) -> io::Result<()> {
+        Ok(())
+    }
+
+    fn watch(&self, _watch_callback: WatchCallback) -> tantivy::Result<WatchHandle> {
+        Ok(WatchHandle::empty())
+    }
+}
+
+struct MergeScenario {
+    #[allow(dead_code)]
+    index: Index,
+    segments: Vec<Segment>,
+    settings: IndexSettings,
+    label: String,
+}
+
+fn build_index(
+    num_segments: usize,
+    docs_per_segment: usize,
+    tokens_per_doc: usize,
+    vocab_size: usize,
+) -> MergeScenario {
+    let mut schema_builder = Schema::builder();
+    let body = schema_builder.add_text_field("body", TEXT);
+    let schema = schema_builder.build();
+    let index = Index::create_in_ram(schema.clone());
+
+    assert!(vocab_size > 0);
+    let total_tokens = num_segments * docs_per_segment * tokens_per_doc;
+    let use_unique_terms = vocab_size >= total_tokens;
+    let mut rng = StdRng::from_seed([7u8; 32]);
+    let mut next_token_id: u64 = 0;
+
+    {
+        let mut writer = index.writer_with_num_threads(1, 256_000_000).unwrap();
+        writer.set_merge_policy(Box::new(NoMergePolicy));
+        for _ in 0..num_segments {
+            for _ in 0..docs_per_segment {
+                let mut tokens = Vec::with_capacity(tokens_per_doc);
+                for _ in 0..tokens_per_doc {
+                    let token_id = if use_unique_terms {
+                        let id = next_token_id;
+                        next_token_id += 1;
+                        id
+                    } else {
+                        rng.random_range(0..vocab_size as u64)
+                    };
+                    tokens.push(format!("term_{token_id}"));
+                }
+                writer.add_document(doc!(body => tokens.join(" "))).unwrap();
+            }
+            writer.commit().unwrap();
+        }
+    }
+
+    let segments = index.searchable_segments().unwrap();
+    let settings = index.settings().clone();
+    let label = format!(
+        "segments={}, docs/seg={}, tokens/doc={}, vocab={}",
+        num_segments, docs_per_segment, tokens_per_doc, vocab_size
+    );
+
+    MergeScenario {
+        index,
+        segments,
+        settings,
+        label,
+    }
+}
+
+fn main() {
+    let scenarios = vec![
+        build_index(8, 50_000, 12, 8),
+        build_index(16, 50_000, 12, 8),
+        build_index(16, 100_000, 12, 8),
+        build_index(8, 50_000, 8, 8 * 50_000 * 8),
+    ];
+
+    let mut runner = BenchRunner::new();
+    for scenario in scenarios {
+        let mut group = runner.new_group();
+        group.set_name(format!("merge_segments inv_index — {}", scenario.label));
+        let segments = scenario.segments.clone();
+        let settings = scenario.settings.clone();
+        group.register("merge", move |_| {
+            let output_dir = NullDirectory::default();
+            let filter_doc_ids = vec![None; segments.len()];
+            let merged_index =
+                merge_filtered_segments(&segments, settings.clone(), filter_doc_ids, output_dir)
+                    .unwrap();
+            black_box(merged_index);
+        });
+
+        group.run();
+    }
+}
--- a/benches/regex_all_terms.rs
+++ b/benches/regex_all_terms.rs
@@ -0,0 +1,113 @@
+// Benchmarks regex query that matches all terms in a synthetic index.
+//
+// Corpus model:
+// - N unique terms: t000000, t000001, ...
+// - M docs
+// - K tokens per doc: doc i gets terms derived from (i, token_index)
+//
+// Query:
+// - Regex "t.*" to match all terms
+//
+// Run with:
+// - cargo bench --bench regex_all_terms
+//
+
+use std::fmt::Write;
+
+use binggan::{black_box, BenchRunner};
+use tantivy::collector::Count;
+use tantivy::query::RegexQuery;
+use tantivy::schema::{Schema, TEXT};
+use tantivy::{doc, Index, ReloadPolicy};
+
+const HEAP_SIZE_BYTES: usize = 200_000_000;
+
+#[derive(Clone, Copy)]
+struct BenchConfig {
+    num_terms: usize,
+    num_docs: usize,
+    tokens_per_doc: usize,
+}
+
+fn main() {
+    let configs = default_configs();
+
+    let mut runner = BenchRunner::new();
+    for config in configs {
+        let (index, text_field) = build_index(config, HEAP_SIZE_BYTES);
+        let reader = index
+            .reader_builder()
+            .reload_policy(ReloadPolicy::Manual)
+            .try_into()
+            .expect("reader");
+        let searcher = reader.searcher();
+        let query = RegexQuery::from_pattern("t.*", text_field).expect("regex query");
+
+        let mut group = runner.new_group();
+        group.set_name(format!(
+            "regex_all_terms_t{}_d{}_k{}",
+            config.num_terms, config.num_docs, config.tokens_per_doc
+        ));
+        group.register("regex_count", move |_| {
+            let count = searcher.search(&query, &Count).expect("search");
+            black_box(count);
+        });
+        group.run();
+    }
+}
+
+fn default_configs() -> Vec<BenchConfig> {
+    vec![
+        BenchConfig {
+            num_terms: 10_000,
+            num_docs: 100_000,
+            tokens_per_doc: 1,
+        },
+        BenchConfig {
+            num_terms: 10_000,
+            num_docs: 100_000,
+            tokens_per_doc: 8,
+        },
+        BenchConfig {
+            num_terms: 100_000,
+            num_docs: 100_000,
+            tokens_per_doc: 1,
+        },
+        BenchConfig {
+            num_terms: 100_000,
+            num_docs: 100_000,
+            tokens_per_doc: 8,
+        },
+    ]
+}
+
+fn build_index(config: BenchConfig, heap_size_bytes: usize) -> (Index, tantivy::schema::Field) {
+    let mut schema_builder = Schema::builder();
+    let text_field = schema_builder.add_text_field("text", TEXT);
+    let schema = schema_builder.build();
+    let index = Index::create_in_ram(schema);
+
+    let term_width = config.num_terms.to_string().len();
+    {
+        let mut writer = index
+            .writer_with_num_threads(1, heap_size_bytes)
+            .expect("writer");
+        let mut buffer = String::new();
+        for doc_id in 0..config.num_docs {
+            buffer.clear();
+            for token_idx in 0..config.tokens_per_doc {
+                if token_idx > 0 {
+                    buffer.push(' ');
+                }
+                let term_id = (doc_id * config.tokens_per_doc + token_idx) % config.num_terms;
+                write!(&mut buffer, "t{term_id:0term_width$}").expect("write token");
+            }
+            writer
+                .add_document(doc!(text_field => buffer.as_str()))
+                .expect("add_document");
+        }
+        writer.commit().expect("commit");
+    }
+
+    (index, text_field)
+}
--- a/common/Cargo.toml
+++ b/common/Cargo.toml
@@ -15,11 +15,10 @@ repository = "https://github.com/quickwit-oss/tantivy"
 byteorder = "1.4.3"
 ownedbytes = { version= "0.9", path="../ownedbytes" }
 async-trait = "0.1"
-time = { version = "0.3.10", features = ["serde-well-known"] }
+time = { version = "0.3.47", features = ["serde-well-known"] }
 serde = { version = "1.0.136", features = ["derive"] }

 [dev-dependencies]
 binggan = "0.14.0"
 proptest = "1.0.0"
 rand = "0.9"
-
--- a/common/src/writer.rs
+++ b/common/src/writer.rs
@@ -62,7 +62,9 @@ impl<W: TerminatingWrite> TerminatingWrite for CountingWriter<W> {
 pub struct AntiCallToken(());

 /// Trait used to indicate when no more write need to be done on a writer
-pub trait TerminatingWrite: Write + Send + Sync {
+///
+/// Thread-safety is enforced at the call sites that require it.
+pub trait TerminatingWrite: Write {
    /// Indicate that the writer will no longer be used. Internally call terminate_ref.
    fn terminate(mut self) -> io::Result<()>
    where Self: Sized {
--- a/doc/src/json.md
+++ b/doc/src/json.md
@@ -60,7 +60,7 @@ At indexing, tantivy will try to interpret number and strings as different type
 priority order.

 Numbers will be interpreted as u64, i64 and f64 in that order.
-Strings will be interpreted as rfc3999 dates or simple strings.
+Strings will be interpreted as rfc3339 dates or simple strings.

 The first working type is picked and is the only term that is emitted for indexing.
 Note this interpretation happens on a per-document basis, and there is no effort to try to sniff
@@ -81,7 +81,7 @@ Will be interpreted as
 (my_path.my_segment, String, 233) or (my_path.my_segment, u64, 233)
 ```

-Likewise, we need to emit two tokens if the query contains an rfc3999 date.
+Likewise, we need to emit two tokens if the query contains an rfc3339 date.
 Indeed the date could have been actually a single token inside the text of a document at ingestion time. Generally speaking, we will always at least emit a string token in query parsing, and sometimes more.

 If one more json field is defined, things get even more complicated.
--- a/query-grammar/src/query_grammar.rs
+++ b/query-grammar/src/query_grammar.rs
@@ -560,7 +560,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
            (
                (
                    value((), tag(">=")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                bound
@@ -574,7 +574,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
                ),
                (
                    value((), tag("<=")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                UserInputBound::Unbounded,
@@ -588,7 +588,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
                ),
                (
                    value((), tag(">")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                bound
@@ -602,7 +602,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
                ),
                (
                    value((), tag("<")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                UserInputBound::Unbounded,
@@ -704,7 +704,11 @@ fn regex(inp: &str) -> IResult<&str, UserInputLeaf> {
                many1(alt((preceded(char('\\'), char('/')), none_of("/")))),
                char('/'),
            ),
-            peek(alt((multispace1, eof))),
+            peek(alt((
+                value((), multispace1),
+                value((), char(')')),
+                value((), eof),
+            ))),
        ),
        |elements| UserInputLeaf::Regex {
            field: None,
@@ -721,8 +725,12 @@ fn regex_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
            opt_i_err(char('/'), "missing delimiter /"),
        ),
        opt_i_err(
-            peek(alt((multispace1, eof))),
-            "expected whitespace or end of input",
+            peek(alt((
+                value((), multispace1),
+                value((), char(')')),
+                value((), eof),
+            ))),
+            "expected whitespace, closing parenthesis, or end of input",
        ),
    )(inp)
    {
@@ -1323,6 +1331,14 @@ mod test {
        test_parse_query_to_ast_helper("<a", "{\"*\" TO \"a\"}");
        test_parse_query_to_ast_helper("<=a", "{\"*\" TO \"a\"]");
        test_parse_query_to_ast_helper("<=bsd", "{\"*\" TO \"bsd\"]");
+
+        test_parse_query_to_ast_helper("(<=42)", "{\"*\" TO \"42\"]");
+        test_parse_query_to_ast_helper("(<=42 )", "{\"*\" TO \"42\"]");
+        test_parse_query_to_ast_helper("(age:>5)", "\"age\":{\"5\" TO \"*\"}");
+        test_parse_query_to_ast_helper(
+            "(title:bar AND age:>12)",
+            "(+\"title\":bar +\"age\":{\"12\" TO \"*\"})",
+        );
    }

    #[test]
@@ -1699,6 +1715,10 @@ mod test {
        test_parse_query_to_ast_helper("foo:(A OR B)", "(?\"foo\":A ?\"foo\":B)");
        test_parse_query_to_ast_helper("foo:(A* OR B*)", "(?\"foo\":A* ?\"foo\":B*)");
        test_parse_query_to_ast_helper("foo:(*A OR *B)", "(?\"foo\":*A ?\"foo\":*B)");
+
+        // Regexes between parentheses
+        test_parse_query_to_ast_helper("foo:(/A.*/)", "\"foo\":/A.*/");
+        test_parse_query_to_ast_helper("foo:(/A.*/ OR /B.*/)", "(?\"foo\":/A.*/ ?\"foo\":/B.*/)");
    }

    #[test]
--- a/query-grammar/src/user_input_ast.rs
+++ b/query-grammar/src/user_input_ast.rs
@@ -66,6 +66,7 @@ impl UserInputLeaf {
            }
            UserInputLeaf::Range { field, .. } if field.is_none() => *field = Some(default_field),
            UserInputLeaf::Set { field, .. } if field.is_none() => *field = Some(default_field),
+            UserInputLeaf::Regex { field, .. } if field.is_none() => *field = Some(default_field),
            _ => (), // field was already set, do nothing
        }
    }
--- a/src/aggregation/intermediate_agg_result.rs
+++ b/src/aggregation/intermediate_agg_result.rs
@@ -90,6 +90,19 @@ impl From<IntermediateKey> for Key {

 impl Eq for IntermediateKey {}

+impl std::fmt::Display for IntermediateKey {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        match self {
+            IntermediateKey::Str(val) => f.write_str(val),
+            IntermediateKey::F64(val) => f.write_str(&val.to_string()),
+            IntermediateKey::U64(val) => f.write_str(&val.to_string()),
+            IntermediateKey::I64(val) => f.write_str(&val.to_string()),
+            IntermediateKey::Bool(val) => f.write_str(&val.to_string()),
+            IntermediateKey::IpAddr(val) => f.write_str(&val.to_string()),
+        }
+    }
+}
+
 impl std::hash::Hash for IntermediateKey {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        core::mem::discriminant(self).hash(state);
@@ -105,6 +118,21 @@ impl std::hash::Hash for IntermediateKey {
 }

 impl IntermediateAggregationResults {
+    /// Returns a reference to the intermediate aggregation result for the given key.
+    pub fn get(&self, key: &str) -> Option<&IntermediateAggregationResult> {
+        self.aggs_res.get(key)
+    }
+
+    /// Removes and returns the intermediate aggregation result for the given key.
+    pub fn remove(&mut self, key: &str) -> Option<IntermediateAggregationResult> {
+        self.aggs_res.remove(key)
+    }
+
+    /// Returns an iterator over the keys in the intermediate aggregation results.
+    pub fn keys(&self) -> impl Iterator<Item = &String> {
+        self.aggs_res.keys()
+    }
+
    /// Add a result
    pub fn push(&mut self, key: String, value: IntermediateAggregationResult) -> crate::Result<()> {
        let entry = self.aggs_res.entry(key);
@@ -639,6 +667,21 @@ pub struct IntermediateTermBucketResult {
 }

 impl IntermediateTermBucketResult {
+    /// Returns a reference to the map of bucket entries keyed by [`IntermediateKey`].
+    pub fn entries(&self) -> &FxHashMap<IntermediateKey, IntermediateTermBucketEntry> {
+        &self.entries
+    }
+
+    /// Returns the count of documents not included in the returned buckets.
+    pub fn sum_other_doc_count(&self) -> u64 {
+        self.sum_other_doc_count
+    }
+
+    /// Returns the upper bound of the error on document counts in the returned buckets.
+    pub fn doc_count_error_upper_bound(&self) -> u64 {
+        self.doc_count_error_upper_bound
+    }
+
    pub(crate) fn into_final_result(
        self,
        req: &TermsAggregation,
@@ -820,7 +863,7 @@ impl IntermediateRangeBucketEntry {
        };

        // If we have a date type on the histogram buckets, we add the `key_as_string` field as
-        // rfc339
+        // rfc3339
        if column_type == Some(ColumnType::DateTime) {
            if let Some(val) = range_bucket_entry.to {
                let key_as_string = format_date(val as i64)?;
--- a/src/aggregation/metric/average.rs
+++ b/src/aggregation/metric/average.rs
@@ -55,6 +55,12 @@ impl IntermediateAverage {
    pub(crate) fn from_stats(stats: IntermediateStats) -> Self {
        Self { stats }
    }
+
+    /// Returns a reference to the underlying [`IntermediateStats`].
+    pub fn stats(&self) -> &IntermediateStats {
+        &self.stats
+    }
+
    /// Merges the other intermediate result into self.
    pub fn merge_fruits(&mut self, other: IntermediateAverage) {
        self.stats.merge_fruits(other.stats);
--- a/src/aggregation/metric/cardinality.rs
+++ b/src/aggregation/metric/cardinality.rs
@@ -1,12 +1,11 @@
-use std::collections::hash_map::DefaultHasher;
-use std::hash::{BuildHasher, Hasher};
+use std::hash::Hash;

 use columnar::column_values::CompactSpaceU64Accessor;
 use columnar::{Column, ColumnType, Dictionary, StrColumn};
 use common::f64_to_u64;
-use hyperloglogplus::{HyperLogLog, HyperLogLogPlus};
+use datasketches::hll::{HllSketch, HllType, HllUnion};
 use rustc_hash::FxHashSet;
-use serde::{Deserialize, Serialize};
+use serde::{Deserialize, Deserializer, Serialize, Serializer};

 use crate::aggregation::agg_data::AggregationsSegmentCtx;
 use crate::aggregation::intermediate_agg_result::{
@@ -16,29 +15,17 @@ use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
 use crate::aggregation::*;
 use crate::TantivyError;

-#[derive(Clone, Debug, Serialize, Deserialize)]
-struct BuildSaltedHasher {
-    salt: u8,
-}
-
-impl BuildHasher for BuildSaltedHasher {
-    type Hasher = DefaultHasher;
-
-    fn build_hasher(&self) -> Self::Hasher {
-        let mut hasher = DefaultHasher::new();
-        hasher.write_u8(self.salt);
-
-        hasher
-    }
-}
+/// Log2 of the number of registers for the HLL sketch.
+/// 2^11 = 2048 registers, giving ~2.3% relative error and ~1KB per sketch (Hll4).
+const LG_K: u8 = 11;

 /// # Cardinality
 ///
 /// The cardinality aggregation allows for computing an estimate
 /// of the number of different values in a data set based on the
-/// HyperLogLog++ algorithm. This is particularly useful for understanding the
-/// uniqueness of values in a large dataset where counting each unique value
-/// individually would be computationally expensive.
+/// Apache DataSketches HyperLogLog algorithm. This is particularly useful for
+/// understanding the uniqueness of values in a large dataset where counting
+/// each unique value individually would be computationally expensive.
 ///
 /// For example, you might use a cardinality aggregation to estimate the number
 /// of unique visitors to a website by aggregating on a field that contains
@@ -184,7 +171,7 @@ impl SegmentCardinalityCollectorBucket {

            term_ids.sort_unstable();
            dict.sorted_ords_to_term_cb(term_ids.iter().map(|term| *term as u64), |term| {
-                self.cardinality.sketch.insert_any(&term);
+                self.cardinality.insert(term);
                Ok(())
            })?;
            if has_missing {
@@ -195,17 +182,17 @@ impl SegmentCardinalityCollectorBucket {
                    );
                match missing_key {
                    Key::Str(missing) => {
-                        self.cardinality.sketch.insert_any(&missing);
+                        self.cardinality.insert(missing.as_str());
                    }
                    Key::F64(val) => {
                        let val = f64_to_u64(*val);
-                        self.cardinality.sketch.insert_any(&val);
+                        self.cardinality.insert(val);
                    }
                    Key::U64(val) => {
-                        self.cardinality.sketch.insert_any(&val);
+                        self.cardinality.insert(*val);
                    }
                    Key::I64(val) => {
-                        self.cardinality.sketch.insert_any(&val);
+                        self.cardinality.insert(*val);
                    }
                }
            }
@@ -296,11 +283,11 @@ impl SegmentAggregationCollector for SegmentCardinalityCollector {
                })?;
            for val in col_block_accessor.iter_vals() {
                let val: u128 = compact_space_accessor.compact_to_u128(val as u32);
-                bucket.cardinality.sketch.insert_any(&val);
+                bucket.cardinality.insert(val);
            }
        } else {
            for val in col_block_accessor.iter_vals() {
-                bucket.cardinality.sketch.insert_any(&val);
+                bucket.cardinality.insert(val);
            }
        }

@@ -321,11 +308,18 @@ impl SegmentAggregationCollector for SegmentCardinalityCollector {
    }
 }

-#[derive(Clone, Debug, Serialize, Deserialize)]
-/// The percentiles collector used during segment collection and for merging results.
+#[derive(Clone, Debug)]
+/// The cardinality collector used during segment collection and for merging results.
+/// Uses Apache DataSketches HLL (lg_k=11, Hll4) for compact binary serialization
+/// and cross-language compatibility (e.g. Java `datasketches` library).
 pub struct CardinalityCollector {
-    sketch: HyperLogLogPlus<u64, BuildSaltedHasher>,
+    sketch: HllSketch,
+    /// Salt derived from `ColumnType`, used to differentiate values of different column types
+    /// that map to the same u64 (e.g. bool `false` = 0 vs i64 `0`).
+    /// Not serialized — only needed during insertion, not after sketch registers are populated.
+    salt: u8,
 }
+
 impl Default for CardinalityCollector {
    fn default() -> Self {
        Self::new(0)
@@ -338,25 +332,52 @@ impl PartialEq for CardinalityCollector {
    }
 }

-impl CardinalityCollector {
-    /// Compute the final cardinality estimate.
-    pub fn finalize(self) -> Option<f64> {
-        Some(self.sketch.clone().count().trunc())
+impl Serialize for CardinalityCollector {
+    fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
+        let bytes = self.sketch.serialize();
+        serializer.serialize_bytes(&bytes)
    }
+}

+impl<'de> Deserialize<'de> for CardinalityCollector {
+    fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
+        let bytes: Vec<u8> = Deserialize::deserialize(deserializer)?;
+        let sketch = HllSketch::deserialize(&bytes).map_err(serde::de::Error::custom)?;
+        Ok(Self { sketch, salt: 0 })
+    }
+}
+
+impl CardinalityCollector {
    fn new(salt: u8) -> Self {
        Self {
-            sketch: HyperLogLogPlus::new(16, BuildSaltedHasher { salt }).unwrap(),
+            sketch: HllSketch::new(LG_K, HllType::Hll4),
+            salt,
        }
    }

-    pub(crate) fn merge_fruits(&mut self, right: CardinalityCollector) -> crate::Result<()> {
-        self.sketch.merge(&right.sketch).map_err(|err| {
-            TantivyError::AggregationError(AggregationError::InternalError(format!(
-                "Error while merging cardinality {err:?}"
-            )))
-        })?;
+    /// Insert a value into the HLL sketch, salted by the column type.
+    /// The salt ensures that identical u64 values from different column types
+    /// (e.g. bool `false` vs i64 `0`) are counted as distinct.
+    pub(crate) fn insert<T: Hash>(&mut self, value: T) {
+        self.sketch.update((self.salt, value));
+    }

+    /// Compute the final cardinality estimate.
+    pub fn finalize(self) -> Option<f64> {
+        Some(self.sketch.estimate().trunc())
+    }
+
+    /// Serialize the HLL sketch to its compact binary representation.
+    /// The format is cross-language compatible with Apache DataSketches (Java, C++, Python).
+    pub fn to_sketch_bytes(&self) -> Vec<u8> {
+        self.sketch.serialize()
+    }
+
+    pub(crate) fn merge_fruits(&mut self, right: CardinalityCollector) -> crate::Result<()> {
+        let mut union = HllUnion::new(LG_K);
+        union.update(&self.sketch);
+        union.update(&right.sketch);
+        self.sketch = union.get_result(HllType::Hll4);
        Ok(())
    }
 }
@@ -518,4 +539,75 @@ mod tests {

        Ok(())
    }
+
+    #[test]
+    fn cardinality_collector_serde_roundtrip() {
+        use super::CardinalityCollector;
+
+        let mut collector = CardinalityCollector::default();
+        collector.insert("hello");
+        collector.insert("world");
+        collector.insert("hello"); // duplicate
+
+        let serialized = serde_json::to_vec(&collector).unwrap();
+        let deserialized: CardinalityCollector = serde_json::from_slice(&serialized).unwrap();
+
+        let original_estimate = collector.finalize().unwrap();
+        let roundtrip_estimate = deserialized.finalize().unwrap();
+        assert_eq!(original_estimate, roundtrip_estimate);
+        assert_eq!(original_estimate, 2.0);
+    }
+
+    #[test]
+    fn cardinality_collector_merge() {
+        use super::CardinalityCollector;
+
+        let mut left = CardinalityCollector::default();
+        left.insert("a");
+        left.insert("b");
+
+        let mut right = CardinalityCollector::default();
+        right.insert("b");
+        right.insert("c");
+
+        left.merge_fruits(right).unwrap();
+        let estimate = left.finalize().unwrap();
+        assert_eq!(estimate, 3.0);
+    }
+
+    #[test]
+    fn cardinality_collector_serialize_deserialize_binary() {
+        use datasketches::hll::HllSketch;
+
+        use super::CardinalityCollector;
+
+        let mut collector = CardinalityCollector::default();
+        collector.insert("apple");
+        collector.insert("banana");
+        collector.insert("cherry");
+
+        let bytes = collector.to_sketch_bytes();
+        let deserialized = HllSketch::deserialize(&bytes).unwrap();
+        assert!((deserialized.estimate() - 3.0).abs() < 0.01);
+    }
+
+    #[test]
+    fn cardinality_collector_salt_differentiates_types() {
+        use super::CardinalityCollector;
+
+        // Without salt, same u64 value from different column types would collide
+        let mut collector_bool = CardinalityCollector::new(5); // e.g. ColumnType::Bool
+        collector_bool.insert(0u64); // false
+        collector_bool.insert(1u64); // true
+
+        let mut collector_i64 = CardinalityCollector::new(2); // e.g. ColumnType::I64
+        collector_i64.insert(0u64);
+        collector_i64.insert(1u64);
+
+        // Merge them
+        collector_bool.merge_fruits(collector_i64).unwrap();
+        let estimate = collector_bool.finalize().unwrap();
+        // Should be 4 because salt makes (5, 0) != (2, 0) and (5, 1) != (2, 1)
+        assert_eq!(estimate, 4.0);
+    }
 }
--- a/src/aggregation/metric/mod.rs
+++ b/src/aggregation/metric/mod.rs
@@ -107,8 +107,11 @@ pub enum PercentileValues {
 #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
 /// The entry when requesting percentiles with keyed: false
 pub struct PercentileValuesVecEntry {
-    key: f64,
-    value: f64,
+    /// Percentile
+    pub key: f64,
+
+    /// Value at the percentile
+    pub value: f64,
 }

 /// Single-metric aggregations use this common result structure.
--- a/src/aggregation/metric/percentiles.rs
+++ b/src/aggregation/metric/percentiles.rs
@@ -222,6 +222,12 @@ impl PercentilesCollector {
        self.sketch.add(val);
    }

+    /// Encode the underlying DDSketch to Java-compatible binary format
+    /// for cross-language serialization with Java consumers.
+    pub fn to_sketch_bytes(&self) -> Vec<u8> {
+        self.sketch.to_java_bytes()
+    }
+
    pub(crate) fn merge_fruits(&mut self, right: PercentilesCollector) -> crate::Result<()> {
        self.sketch.merge(&right.sketch).map_err(|err| {
            TantivyError::AggregationError(AggregationError::InternalError(format!(
@@ -325,7 +331,7 @@ mod tests {
    use crate::aggregation::AggregationCollector;
    use crate::query::AllQuery;
    use crate::schema::{Schema, FAST};
-    use crate::Index;
+    use crate::{assert_nearly_equals, Index};

    #[test]
    fn test_aggregation_percentiles_empty_index() -> crate::Result<()> {
@@ -608,12 +614,16 @@ mod tests {
        let res = exec_request_with_query(agg_req, &index, None)?;
        assert_eq!(res["range_with_stats"]["buckets"][0]["doc_count"], 3);

-        assert_eq!(
-            res["range_with_stats"]["buckets"][0]["percentiles"]["values"]["1.0"],
+        assert_nearly_equals!(
+            res["range_with_stats"]["buckets"][0]["percentiles"]["values"]["1.0"]
+                .as_f64()
+                .unwrap(),
            5.0028295751107414
        );
-        assert_eq!(
-            res["range_with_stats"]["buckets"][0]["percentiles"]["values"]["99.0"],
+        assert_nearly_equals!(
+            res["range_with_stats"]["buckets"][0]["percentiles"]["values"]["99.0"]
+                .as_f64()
+                .unwrap(),
            10.07469668951144
        );

@@ -659,8 +669,14 @@ mod tests {

        let res = exec_request_with_query(agg_req, &index, None)?;

-        assert_eq!(res["percentiles"]["values"]["1.0"], 5.0028295751107414);
-        assert_eq!(res["percentiles"]["values"]["99.0"], 10.07469668951144);
+        assert_nearly_equals!(
+            res["percentiles"]["values"]["1.0"].as_f64().unwrap(),
+            5.0028295751107414
+        );
+        assert_nearly_equals!(
+            res["percentiles"]["values"]["99.0"].as_f64().unwrap(),
+            10.07469668951144
+        );

        Ok(())
    }
--- a/src/aggregation/metric/stats.rs
+++ b/src/aggregation/metric/stats.rs
@@ -110,6 +110,16 @@ impl Default for IntermediateStats {
 }

 impl IntermediateStats {
+    /// Returns the number of values collected.
+    pub fn count(&self) -> u64 {
+        self.count
+    }
+
+    /// Returns the sum of all values collected.
+    pub fn sum(&self) -> f64 {
+        self.sum
+    }
+
    /// Merges the other stats intermediate result into self.
    pub fn merge_fruits(&mut self, other: IntermediateStats) {
        self.count += other.count;
--- a/src/collector/sort_key/mod.rs
+++ b/src/collector/sort_key/mod.rs
@@ -1,4 +1,5 @@
 mod order;
+mod sort_by_bytes;
 mod sort_by_erased_type;
 mod sort_by_score;
 mod sort_by_static_fast_value;
@@ -6,6 +7,7 @@ mod sort_by_string;
 mod sort_key_computer;

 pub use order::*;
+pub use sort_by_bytes::SortByBytes;
 pub use sort_by_erased_type::SortByErasedType;
 pub use sort_by_score::SortBySimilarityScore;
 pub use sort_by_static_fast_value::SortByStaticFastValue;
--- a/src/collector/sort_key/sort_by_bytes.rs
+++ b/src/collector/sort_key/sort_by_bytes.rs
@@ -0,0 +1,168 @@
+use columnar::BytesColumn;
+
+use crate::collector::sort_key::NaturalComparator;
+use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
+use crate::termdict::TermOrdinal;
+use crate::{DocId, Score};
+
+/// Sort by the first value of a bytes column.
+///
+/// If the field is multivalued, only the first value is considered.
+///
+/// Documents that do not have this value are still considered.
+/// Their sort key will simply be `None`.
+#[derive(Debug, Clone)]
+pub struct SortByBytes {
+    column_name: String,
+}
+
+impl SortByBytes {
+    /// Creates a new sort by bytes sort key computer.
+    pub fn for_field(column_name: impl ToString) -> Self {
+        SortByBytes {
+            column_name: column_name.to_string(),
+        }
+    }
+}
+
+impl SortKeyComputer for SortByBytes {
+    type SortKey = Option<Vec<u8>>;
+    type Child = ByBytesColumnSegmentSortKeyComputer;
+    type Comparator = NaturalComparator;
+
+    fn segment_sort_key_computer(
+        &self,
+        segment_reader: &crate::SegmentReader,
+    ) -> crate::Result<Self::Child> {
+        let bytes_column_opt = segment_reader.fast_fields().bytes(&self.column_name)?;
+        Ok(ByBytesColumnSegmentSortKeyComputer { bytes_column_opt })
+    }
+}
+
+/// Segment-level sort key computer for bytes columns.
+pub struct ByBytesColumnSegmentSortKeyComputer {
+    bytes_column_opt: Option<BytesColumn>,
+}
+
+impl SegmentSortKeyComputer for ByBytesColumnSegmentSortKeyComputer {
+    type SortKey = Option<Vec<u8>>;
+    type SegmentSortKey = Option<TermOrdinal>;
+    type SegmentComparator = NaturalComparator;
+
+    #[inline(always)]
+    fn segment_sort_key(&mut self, doc: DocId, _score: Score) -> Option<TermOrdinal> {
+        let bytes_column = self.bytes_column_opt.as_ref()?;
+        bytes_column.ords().first(doc)
+    }
+
+    fn convert_segment_sort_key(&self, term_ord_opt: Option<TermOrdinal>) -> Option<Vec<u8>> {
+        // TODO: Individual lookups to the dictionary like this are very likely to repeatedly
+        // decompress the same blocks. See https://github.com/quickwit-oss/tantivy/issues/2776
+        let term_ord = term_ord_opt?;
+        let bytes_column = self.bytes_column_opt.as_ref()?;
+        let mut bytes = Vec::new();
+        bytes_column
+            .dictionary()
+            .ord_to_term(term_ord, &mut bytes)
+            .ok()?;
+        Some(bytes)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::SortByBytes;
+    use crate::collector::TopDocs;
+    use crate::query::AllQuery;
+    use crate::schema::{BytesOptions, Schema, FAST, INDEXED};
+    use crate::{Index, IndexWriter, Order, TantivyDocument};
+
+    #[test]
+    fn test_sort_by_bytes_asc() -> crate::Result<()> {
+        let mut schema_builder = Schema::builder();
+        let bytes_field = schema_builder
+            .add_bytes_field("data", BytesOptions::default().set_fast().set_indexed());
+        let id_field = schema_builder.add_u64_field("id", FAST | INDEXED);
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema);
+        let mut index_writer: IndexWriter = index.writer_for_tests()?;
+
+        // Insert documents with byte values in non-sorted order
+        let test_data: Vec<(u64, Vec<u8>)> = vec![
+            (1, vec![0x02, 0x00]),
+            (2, vec![0x00, 0x10]),
+            (3, vec![0x01, 0x00]),
+            (4, vec![0x00, 0x20]),
+        ];
+
+        for (id, bytes) in &test_data {
+            let mut doc = TantivyDocument::new();
+            doc.add_u64(id_field, *id);
+            doc.add_bytes(bytes_field, bytes);
+            index_writer.add_document(doc)?;
+        }
+        index_writer.commit()?;
+
+        let reader = index.reader()?;
+        let searcher = reader.searcher();
+
+        // Sort ascending by bytes
+        let top_docs =
+            TopDocs::with_limit(10).order_by((SortByBytes::for_field("data"), Order::Asc));
+        let results: Vec<(Option<Vec<u8>>, _)> = searcher.search(&AllQuery, &top_docs)?;
+
+        // Expected order: [0x00,0x10], [0x00,0x20], [0x01,0x00], [0x02,0x00]
+        let sorted_bytes: Vec<Option<Vec<u8>>> = results.into_iter().map(|(b, _)| b).collect();
+        assert_eq!(
+            sorted_bytes,
+            vec![
+                Some(vec![0x00, 0x10]),
+                Some(vec![0x00, 0x20]),
+                Some(vec![0x01, 0x00]),
+                Some(vec![0x02, 0x00]),
+            ]
+        );
+
+        Ok(())
+    }
+
+    #[test]
+    fn test_sort_by_bytes_desc() -> crate::Result<()> {
+        let mut schema_builder = Schema::builder();
+        let bytes_field = schema_builder
+            .add_bytes_field("data", BytesOptions::default().set_fast().set_indexed());
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema);
+        let mut index_writer: IndexWriter = index.writer_for_tests()?;
+
+        let test_data: Vec<Vec<u8>> = vec![vec![0x00, 0x10], vec![0x02, 0x00], vec![0x01, 0x00]];
+
+        for bytes in &test_data {
+            let mut doc = TantivyDocument::new();
+            doc.add_bytes(bytes_field, bytes);
+            index_writer.add_document(doc)?;
+        }
+        index_writer.commit()?;
+
+        let reader = index.reader()?;
+        let searcher = reader.searcher();
+
+        // Sort descending by bytes
+        let top_docs =
+            TopDocs::with_limit(10).order_by((SortByBytes::for_field("data"), Order::Desc));
+        let results: Vec<(Option<Vec<u8>>, _)> = searcher.search(&AllQuery, &top_docs)?;
+
+        // Expected order (descending): [0x02,0x00], [0x01,0x00], [0x00,0x10]
+        let sorted_bytes: Vec<Option<Vec<u8>>> = results.into_iter().map(|(b, _)| b).collect();
+        assert_eq!(
+            sorted_bytes,
+            vec![
+                Some(vec![0x02, 0x00]),
+                Some(vec![0x01, 0x00]),
+                Some(vec![0x00, 0x10]),
+            ]
+        );
+
+        Ok(())
+    }
+}
--- a/src/collector/sort_key/sort_by_erased_type.rs
+++ b/src/collector/sort_key/sort_by_erased_type.rs
@@ -1,7 +1,7 @@
 use columnar::{ColumnType, MonotonicallyMappableToU64};

 use crate::collector::sort_key::{
-    NaturalComparator, SortBySimilarityScore, SortByStaticFastValue, SortByString,
+    NaturalComparator, SortByBytes, SortBySimilarityScore, SortByStaticFastValue, SortByString,
 };
 use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
 use crate::fastfield::FastFieldNotAvailableError;
@@ -114,6 +114,16 @@ impl SortKeyComputer for SortByErasedType {
                            },
                        })
                    }
+                    ColumnType::Bytes => {
+                        let computer = SortByBytes::for_field(column_name);
+                        let inner = computer.segment_sort_key_computer(segment_reader)?;
+                        Box::new(ErasedSegmentSortKeyComputerWrapper {
+                            inner,
+                            converter: |val: Option<Vec<u8>>| {
+                                val.map(OwnedValue::Bytes).unwrap_or(OwnedValue::Null)
+                            },
+                        })
+                    }
                    ColumnType::U64 => {
                        let computer = SortByStaticFastValue::<u64>::for_field(column_name);
                        let inner = computer.segment_sort_key_computer(segment_reader)?;
@@ -281,6 +291,65 @@ mod tests {
        );
    }

+    #[test]
+    fn test_sort_by_owned_bytes() {
+        let mut schema_builder = Schema::builder();
+        let data_field = schema_builder.add_bytes_field("data", FAST);
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema);
+        let mut writer = index.writer_for_tests().unwrap();
+        writer
+            .add_document(doc!(data_field => vec![0x03u8, 0x00]))
+            .unwrap();
+        writer
+            .add_document(doc!(data_field => vec![0x01u8, 0x00]))
+            .unwrap();
+        writer
+            .add_document(doc!(data_field => vec![0x02u8, 0x00]))
+            .unwrap();
+        writer.add_document(doc!()).unwrap();
+        writer.commit().unwrap();
+
+        let reader = index.reader().unwrap();
+        let searcher = reader.searcher();
+
+        // Sort descending (Natural - highest first)
+        let collector = TopDocs::with_limit(10)
+            .order_by((SortByErasedType::for_field("data"), ComparatorEnum::Natural));
+        let top_docs = searcher.search(&AllQuery, &collector).unwrap();
+
+        let values: Vec<OwnedValue> = top_docs.into_iter().map(|(key, _)| key).collect();
+
+        assert_eq!(
+            values,
+            vec![
+                OwnedValue::Bytes(vec![0x03, 0x00]),
+                OwnedValue::Bytes(vec![0x02, 0x00]),
+                OwnedValue::Bytes(vec![0x01, 0x00]),
+                OwnedValue::Null
+            ]
+        );
+
+        // Sort ascending (ReverseNoneLower - lowest first, nulls last)
+        let collector = TopDocs::with_limit(10).order_by((
+            SortByErasedType::for_field("data"),
+            ComparatorEnum::ReverseNoneLower,
+        ));
+        let top_docs = searcher.search(&AllQuery, &collector).unwrap();
+
+        let values: Vec<OwnedValue> = top_docs.into_iter().map(|(key, _)| key).collect();
+
+        assert_eq!(
+            values,
+            vec![
+                OwnedValue::Bytes(vec![0x01, 0x00]),
+                OwnedValue::Bytes(vec![0x02, 0x00]),
+                OwnedValue::Bytes(vec![0x03, 0x00]),
+                OwnedValue::Null
+            ]
+        );
+    }
+
    #[test]
    fn test_sort_by_owned_reverse() {
        let mut schema_builder = Schema::builder();
--- a/src/directory/mmap_directory/mod.rs
+++ b/src/directory/mmap_directory/mod.rs
@@ -676,7 +676,7 @@ mod tests {
            let num_segments = reader.searcher().segment_readers().len();
            assert!(num_segments <= 4);
            let num_components_except_deletes_and_tempstore =
-                crate::index::SegmentComponent::iterator().len() - 2;
+                crate::index::SegmentComponent::iterator().len() - 1;
            let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments;
            assert_eventually(|| {
                let num_mmapped = mmap_directory.get_cache_info().mmapped.len();
--- a/src/directory/mod.rs
+++ b/src/directory/mod.rs
@@ -21,7 +21,7 @@ use std::path::PathBuf;
 pub use common::file_slice::{FileHandle, FileSlice};
 pub use common::{AntiCallToken, OwnedBytes, TerminatingWrite};

-pub(crate) use self::composite_file::{CompositeFile, CompositeWrite};
+pub use self::composite_file::{CompositeFile, CompositeWrite};
 pub use self::directory::{Directory, DirectoryClone, DirectoryLock};
 pub use self::directory_lock::{Lock, INDEX_WRITER_LOCK, META_LOCK};
 pub use self::ram_directory::RamDirectory;
@@ -52,7 +52,7 @@ pub use self::mmap_directory::MmapDirectory;
 ///
 /// `WritePtr` are required to implement both Write
 /// and Seek.
-pub type WritePtr = BufWriter<Box<dyn TerminatingWrite>>;
+pub type WritePtr = BufWriter<Box<dyn TerminatingWrite + Send + Sync>>;

 #[cfg(test)]
 mod tests;
--- a/src/docset.rs
+++ b/src/docset.rs
@@ -65,8 +65,8 @@ pub trait DocSet: Send {
    ///   `seek_danger(..)` until it returns `Found`, and get back to a valid state.
    ///
    /// `seek_lower_bound` can be any `DocId` (in the docset or not) as long as it is in
-    /// `(target .. seek_result]` where `seek_result` is the first document in the docset greater
-    /// than to `target`.
+    /// `(target .. seek_result] U {TERMINATED}` where `seek_result` is the first document in the
+    /// docset greater than to `target`.
    ///
    /// `seek_danger` may return `SeekLowerBound(TERMINATED)`.
    ///
@@ -98,7 +98,7 @@ pub trait DocSet: Send {
        if doc == target {
            SeekDangerResult::Found
        } else {
-            SeekDangerResult::SeekLowerBound(self.doc())
+            SeekDangerResult::SeekLowerBound(doc)
        }
    }

--- a/src/index/index_meta.rs
+++ b/src/index/index_meta.rs
@@ -1,8 +1,6 @@
 use std::collections::HashSet;
 use std::fmt;
 use std::path::PathBuf;
-use std::sync::atomic::AtomicBool;
-use std::sync::Arc;

 use serde::{Deserialize, Serialize};

@@ -37,7 +35,6 @@ impl SegmentMetaInventory {
        let inner = InnerSegmentMeta {
            segment_id,
            max_doc,
-            include_temp_doc_store: Arc::new(AtomicBool::new(true)),
            deletes: None,
        };
        SegmentMeta::from(self.inventory.track(inner))
@@ -85,15 +82,6 @@ impl SegmentMeta {
        self.tracked.segment_id
    }

-    /// Removes the Component::TempStore from the alive list and
-    /// therefore marks the temp docstore file to be deleted by
-    /// the garbage collection.
-    pub fn untrack_temp_docstore(&self) {
-        self.tracked
-            .include_temp_doc_store
-            .store(false, std::sync::atomic::Ordering::Relaxed);
-    }
-
    /// Returns the number of deleted documents.
    pub fn num_deleted_docs(&self) -> u32 {
        self.tracked
@@ -111,20 +99,9 @@ impl SegmentMeta {
    /// is by removing all files that have been created by tantivy
    /// and are not used by any segment anymore.
    pub fn list_files(&self) -> HashSet<PathBuf> {
-        if self
-            .tracked
-            .include_temp_doc_store
-            .load(std::sync::atomic::Ordering::Relaxed)
-        {
-            SegmentComponent::iterator()
-                .map(|component| self.relative_path(*component))
-                .collect::<HashSet<PathBuf>>()
-        } else {
-            SegmentComponent::iterator()
-                .filter(|comp| *comp != &SegmentComponent::TempStore)
-                .map(|component| self.relative_path(*component))
-                .collect::<HashSet<PathBuf>>()
-        }
+        SegmentComponent::iterator()
+            .map(|component| self.relative_path(*component))
+            .collect::<HashSet<PathBuf>>()
    }

    /// Returns the relative path of a component of our segment.
@@ -138,7 +115,6 @@ impl SegmentMeta {
            SegmentComponent::Positions => ".pos".to_string(),
            SegmentComponent::Terms => ".term".to_string(),
            SegmentComponent::Store => ".store".to_string(),
-            SegmentComponent::TempStore => ".store.temp".to_string(),
            SegmentComponent::FastFields => ".fast".to_string(),
            SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
            SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
@@ -183,7 +159,6 @@ impl SegmentMeta {
            segment_id: inner_meta.segment_id,
            max_doc,
            deletes: None,
-            include_temp_doc_store: Arc::new(AtomicBool::new(true)),
        });
        SegmentMeta { tracked }
    }
@@ -202,7 +177,6 @@ impl SegmentMeta {
        let tracked = self.tracked.map(move |inner_meta| InnerSegmentMeta {
            segment_id: inner_meta.segment_id,
            max_doc: inner_meta.max_doc,
-            include_temp_doc_store: Arc::new(AtomicBool::new(true)),
            deletes: Some(delete_meta),
        });
        SegmentMeta { tracked }
@@ -214,14 +188,6 @@ struct InnerSegmentMeta {
    segment_id: SegmentId,
    max_doc: u32,
    pub deletes: Option<DeleteMeta>,
-    /// If you want to avoid the SegmentComponent::TempStore file to be covered by
-    /// garbage collection and deleted, set this to true. This is used during merge.
-    #[serde(skip)]
-    #[serde(default = "default_temp_store")]
-    pub(crate) include_temp_doc_store: Arc<AtomicBool>,
-}
-fn default_temp_store() -> Arc<AtomicBool> {
-    Arc::new(AtomicBool::new(false))
 }

 impl InnerSegmentMeta {
--- a/src/index/segment_component.rs
+++ b/src/index/segment_component.rs
@@ -23,8 +23,6 @@ pub enum SegmentComponent {
    /// Accessing a document from the store is relatively slow, as it
    /// requires to decompress the entire block it belongs to.
    Store,
-    /// Temporary storage of the documents, before streamed to `Store`.
-    TempStore,
    /// Bitset describing which document of the segment is alive.
    /// (It was representing deleted docs but changed to represent alive docs from v0.17)
    Delete,
@@ -33,14 +31,13 @@ pub enum SegmentComponent {
 impl SegmentComponent {
    /// Iterates through the components.
    pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
-        static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
+        static SEGMENT_COMPONENTS: [SegmentComponent; 7] = [
            SegmentComponent::Postings,
            SegmentComponent::Positions,
            SegmentComponent::FastFields,
            SegmentComponent::FieldNorms,
            SegmentComponent::Terms,
            SegmentComponent::Store,
-            SegmentComponent::TempStore,
            SegmentComponent::Delete,
        ];
        SEGMENT_COMPONENTS.iter()
--- a/src/indexer/index_writer.rs
+++ b/src/indexer/index_writer.rs
@@ -218,7 +218,7 @@ fn index_documents<D: Document>(
    let alive_bitset_opt = apply_deletes(&segment_with_max_doc, &mut delete_cursor, &doc_opstamps)?;

    let meta = segment_with_max_doc.meta().clone();
-    meta.untrack_temp_docstore();
+
    // update segment_updater inventory to remove tempstore
    let segment_entry = SegmentEntry::new(meta, delete_cursor, alive_bitset_opt);
    segment_updater.schedule_add_segment(segment_entry).wait()?;
--- a/src/indexer/log_merge_policy.rs
+++ b/src/indexer/log_merge_policy.rs
@@ -94,7 +94,7 @@ impl MergePolicy for LogMergePolicy {
    fn compute_merge_candidates(&self, segments: &[SegmentMeta]) -> Vec<MergeCandidate> {
        let size_sorted_segments = segments
            .iter()
-            .filter(|seg| seg.num_docs() <= (self.max_docs_before_merge as u32))
+            .filter(|seg| (seg.num_docs() as usize) <= self.max_docs_before_merge)
            .sorted_by_key(|seg| std::cmp::Reverse(seg.max_doc()))
            .collect::<Vec<&SegmentMeta>>();

@@ -372,4 +372,21 @@ mod tests {
        assert_eq!(merge_candidates[0].0.len(), 1);
        assert_eq!(merge_candidates[0].0[0], test_input[1].id());
    }
+
+    #[test]
+    fn test_max_docs_before_merge_large_value() {
+        // Regression test: (max_docs_before_merge as u32) truncates values > u32::MAX.
+        // Casting num_docs() to usize instead avoids the truncation.
+        let mut policy = LogMergePolicy::default();
+        policy.set_min_num_segments(2);
+        policy.set_max_docs_before_merge(5_000_000_000usize);
+        let test_input = vec![
+            create_random_segment_meta(100_000),
+            create_random_segment_meta(100_000),
+        ];
+        let result = policy.compute_merge_candidates(&test_input);
+        // Both segments should be eligible (100_000 < 5_000_000_000)
+        assert_eq!(result.len(), 1);
+        assert_eq!(result[0].0.len(), 2);
+    }
 }
--- a/src/indexer/segment_updater.rs
+++ b/src/indexer/segment_updater.rs
@@ -403,7 +403,8 @@ impl SegmentUpdater {
            // from the different drives.
            //
            // Segment 1 from disk 1, Segment 1 from disk 2, etc.
-            committed_segment_metas.sort_by_key(|segment_meta| -(segment_meta.max_doc() as i32));
+            committed_segment_metas
+                .sort_by_key(|segment_meta| std::cmp::Reverse(segment_meta.max_doc()));
            let index_meta = IndexMeta {
                index_settings: index.settings().clone(),
                segments: committed_segment_metas,
@@ -705,6 +706,7 @@ mod tests {
    use crate::collector::TopDocs;
    use crate::directory::RamDirectory;
    use crate::fastfield::AliveBitSet;
+    use crate::index::{SegmentId, SegmentMetaInventory};
    use crate::indexer::merge_policy::tests::MergeWheneverPossible;
    use crate::indexer::merger::IndexMerger;
    use crate::indexer::segment_updater::merge_filtered_segments;
@@ -712,6 +714,22 @@ mod tests {
    use crate::schema::*;
    use crate::{Directory, DocAddress, Index, Segment};

+    #[test]
+    fn test_segment_sort_large_max_doc() {
+        // Regression test: -(max_doc as i32) overflows for max_doc >= 2^31.
+        // Using std::cmp::Reverse avoids this.
+        let inventory = SegmentMetaInventory::default();
+        let mut metas = vec![
+            inventory.new_segment_meta(SegmentId::generate_random(), 100),
+            inventory.new_segment_meta(SegmentId::generate_random(), (1u32 << 31) - 1),
+            inventory.new_segment_meta(SegmentId::generate_random(), 50_000),
+        ];
+        metas.sort_by_key(|m| std::cmp::Reverse(m.max_doc()));
+        assert_eq!(metas[0].max_doc(), (1u32 << 31) - 1);
+        assert_eq!(metas[1].max_doc(), 50_000);
+        assert_eq!(metas[2].max_doc(), 100);
+    }
+
    #[test]
    fn test_delete_during_merge() -> crate::Result<()> {
        let mut schema_builder = Schema::builder();
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -169,8 +169,10 @@ mod macros;
 mod future_result;

 // Re-exports
+pub use columnar;
 pub use common::{ByteCount, DateTime};
-pub use {columnar, query_grammar, time};
+pub use query_grammar;
+pub use time;

 pub use crate::error::TantivyError;
 pub use crate::future_result::FutureResult;
--- a/src/postings/block_segment_postings.rs
+++ b/src/postings/block_segment_postings.rs
@@ -303,10 +303,10 @@ impl BlockSegmentPostings {
    }

    pub(crate) fn load_block(&mut self) {
-        let offset = self.skip_reader.byte_offset();
        if self.block_is_loaded() {
            return;
        }
+        let offset = self.skip_reader.byte_offset();
        match self.skip_reader.block_info() {
            BlockInfo::BitPacked {
                doc_num_bits,
--- a/src/postings/mod.rs
+++ b/src/postings/mod.rs
@@ -14,7 +14,8 @@ mod postings;
 mod postings_writer;
 mod recorder;
 mod segment_postings;
-mod serializer;
+/// Serializer module for the inverted index
+pub mod serializer;
 mod skip;
 mod term_info;

--- a/src/postings/segment_postings.rs
+++ b/src/postings/segment_postings.rs
@@ -168,12 +168,20 @@ impl DocSet for SegmentPostings {
        self.doc()
    }

+    #[inline]
    fn seek(&mut self, target: DocId) -> DocId {
        debug_assert!(self.doc() <= target);
        if self.doc() >= target {
            return self.doc();
        }

+        // As an optimization, if the block is already loaded, we can
+        // cheaply check the next doc.
+        self.cur = (self.cur + 1).min(COMPRESSION_BLOCK_SIZE - 1);
+        if self.doc() >= target {
+            return self.doc();
+        }
+
        // Delegate block-local search to BlockSegmentPostings::seek, which returns
        // the in-block index of the first doc >= target.
        self.cur = self.block_cursor.seek(target);
--- a/src/postings/serializer.rs
+++ b/src/postings/serializer.rs
@@ -11,7 +11,7 @@ use crate::positions::PositionSerializer;
 use crate::postings::compression::{BlockEncoder, VIntEncoder, COMPRESSION_BLOCK_SIZE};
 use crate::postings::skip::SkipSerializer;
 use crate::query::Bm25Weight;
-use crate::schema::{Field, FieldEntry, FieldType, IndexRecordOption, Schema};
+use crate::schema::{Field, FieldEntry, IndexRecordOption, Schema};
 use crate::termdict::TermDictionaryBuilder;
 use crate::{DocId, Score};

@@ -80,9 +80,12 @@ impl InvertedIndexSerializer {
        let term_dictionary_write = self.terms_write.for_field(field);
        let postings_write = self.postings_write.for_field(field);
        let positions_write = self.positions_write.for_field(field);
-        let field_type: FieldType = (*field_entry.field_type()).clone();
+        let index_record_option = field_entry
+            .field_type()
+            .index_record_option()
+            .unwrap_or(IndexRecordOption::Basic);
        FieldSerializer::create(
-            &field_type,
+            index_record_option,
            total_num_tokens,
            term_dictionary_write,
            postings_write,
@@ -102,29 +105,27 @@ impl InvertedIndexSerializer {

 /// The field serializer is in charge of
 /// the serialization of a specific field.
-pub struct FieldSerializer<'a> {
-    term_dictionary_builder: TermDictionaryBuilder<&'a mut CountingWriter<WritePtr>>,
+pub struct FieldSerializer<'a, W: Write = WritePtr> {
+    term_dictionary_builder: TermDictionaryBuilder<&'a mut CountingWriter<W>>,
    postings_serializer: PostingsSerializer,
-    positions_serializer_opt: Option<PositionSerializer<&'a mut CountingWriter<WritePtr>>>,
+    positions_serializer_opt: Option<PositionSerializer<&'a mut CountingWriter<W>>>,
    current_term_info: TermInfo,
    term_open: bool,
-    postings_write: &'a mut CountingWriter<WritePtr>,
+    postings_write: &'a mut CountingWriter<W>,
    postings_start_offset: u64,
 }

-impl<'a> FieldSerializer<'a> {
-    fn create(
-        field_type: &FieldType,
+impl<'a, W: Write> FieldSerializer<'a, W> {
+    /// Creates a new `FieldSerializer` for the given field type.
+    pub fn create(
+        index_record_option: IndexRecordOption,
        total_num_tokens: u64,
-        term_dictionary_write: &'a mut CountingWriter<WritePtr>,
-        postings_write: &'a mut CountingWriter<WritePtr>,
-        positions_write: &'a mut CountingWriter<WritePtr>,
+        term_dictionary_write: &'a mut CountingWriter<W>,
+        postings_write: &'a mut CountingWriter<W>,
+        positions_write: &'a mut CountingWriter<W>,
        fieldnorm_reader: Option<FieldNormReader>,
-    ) -> io::Result<FieldSerializer<'a>> {
+    ) -> io::Result<FieldSerializer<'a, W>> {
        total_num_tokens.serialize(postings_write)?;
-        let index_record_option = field_type
-            .index_record_option()
-            .unwrap_or(IndexRecordOption::Basic);
        let term_dictionary_builder = TermDictionaryBuilder::create(term_dictionary_write)?;
        let average_fieldnorm = fieldnorm_reader
            .as_ref()
@@ -192,6 +193,11 @@ impl<'a> FieldSerializer<'a> {
        Ok(())
    }

+    /// Starts the postings for a new term without recording term frequencies.
+    pub fn new_term_without_freq(&mut self, term: &[u8]) -> io::Result<()> {
+        self.new_term(term, 0, false)
+    }
+
    /// Serialize the information that a document contains for the current term:
    /// its term frequency, and the position deltas.
    ///
@@ -297,6 +303,7 @@ impl Block {
    }
 }

+/// Serializer for postings lists.
 pub struct PostingsSerializer {
    last_doc_id_encoded: u32,

@@ -316,6 +323,9 @@ pub struct PostingsSerializer {
 }

 impl PostingsSerializer {
+    /// Creates a new `PostingsSerializer`.
+    /// * avg_fieldnorm - average field norm for the field being serialized.
+    /// * mode - indexing options for the field being serialized.
    pub fn new(
        avg_fieldnorm: Score,
        mode: IndexRecordOption,
@@ -338,6 +348,8 @@ impl PostingsSerializer {
        }
    }

+    /// Starts the serialization for a new term.
+    /// * term_doc_freq - the number of documents containing the term.
    pub fn new_term(&mut self, term_doc_freq: u32, record_term_freq: bool) {
        self.bm25_weight = None;

@@ -377,6 +389,7 @@ impl PostingsSerializer {
            self.postings_write.extend(block_encoded);
        }
        if self.term_has_freq {
+            // encode the term frequencies
            let (num_bits, block_encoded): (u8, &[u8]) = self
                .block_encoder
                .compress_block_unsorted(self.block.term_freqs(), true);
@@ -417,6 +430,9 @@ impl PostingsSerializer {
        self.block.clear();
    }

+    /// Register that the given document contains the current term.
+    /// * doc_id - the document id.
+    /// * term_freq - the term frequency within the document.
    pub fn write_doc(&mut self, doc_id: DocId, term_freq: u32) {
        self.block.append_doc(doc_id, term_freq);
        if self.block.is_full() {
@@ -424,6 +440,7 @@ impl PostingsSerializer {
        }
    }

+    /// Finish the serialization for this term.
    pub fn close_term(
        &mut self,
        doc_freq: u32,
--- a/src/postings/skip.rs
+++ b/src/postings/skip.rs
@@ -14,7 +14,11 @@ use crate::{DocId, Score, TERMINATED};
 //   (requiring a 6th bit), but the biggest doc_id we can want to encode is TERMINATED-1, which can
 //   be represented on 31b without delta encoding.
 fn encode_bitwidth(bitwidth: u8, delta_1: bool) -> u8 {
-    assert!(bitwidth < 32);
+    assert!(
+        bitwidth < 32,
+        "bitwidth needs to be less than 32, but got {}",
+        bitwidth
+    );
    bitwidth | ((delta_1 as u8) << 6)
 }

--- a/src/query/boolean_query/boolean_weight.rs
+++ b/src/query/boolean_query/boolean_weight.rs
@@ -291,18 +291,6 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
            }
        };

-        let exclude_scorer_opt: Option<Box<dyn Scorer>> = if exclude_scorers.is_empty() {
-            None
-        } else {
-            let exclude_specialized_scorer: SpecializedScorer =
-                scorer_union(exclude_scorers, DoNothingCombiner::default, num_docs);
-            Some(into_box_scorer(
-                exclude_specialized_scorer,
-                DoNothingCombiner::default,
-                num_docs,
-            ))
-        };
-
        let include_scorer = match (should_scorers, must_scorers) {
            (ShouldScorersCombinationMethod::Ignored, must_scorers) => {
                // No SHOULD clauses (or they were absorbed into MUST).
@@ -380,16 +368,23 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
                }
            }
        };
-        if let Some(exclude_scorer) = exclude_scorer_opt {
-            let include_scorer_boxed =
-                into_box_scorer(include_scorer, &score_combiner_fn, num_docs);
-            Ok(SpecializedScorer::Other(Box::new(Exclude::new(
-                include_scorer_boxed,
-                exclude_scorer,
-            ))))
-        } else {
-            Ok(include_scorer)
+        if exclude_scorers.is_empty() {
+            return Ok(include_scorer);
        }
+
+        let include_scorer_boxed = into_box_scorer(include_scorer, &score_combiner_fn, num_docs);
+        let scorer: Box<dyn Scorer> = if exclude_scorers.len() == 1 {
+            let exclude_scorer = exclude_scorers.pop().unwrap();
+            match exclude_scorer.downcast::<TermScorer>() {
+                // Cast to TermScorer succeeded
+                Ok(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, *exclude_scorer)),
+                // We get back the original Box<dyn Scorer>
+                Err(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, exclude_scorer)),
+            }
+        } else {
+            Box::new(Exclude::new(include_scorer_boxed, exclude_scorers))
+        };
+        Ok(SpecializedScorer::Other(scorer))
    }
 }

--- a/src/query/exclude.rs
+++ b/src/query/exclude.rs
@@ -1,48 +1,71 @@
-use crate::docset::{DocSet, TERMINATED};
+use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
 use crate::query::Scorer;
 use crate::{DocId, Score};

-#[inline]
-fn is_within<TDocSetExclude: DocSet>(docset: &mut TDocSetExclude, doc: DocId) -> bool {
-    docset.doc() <= doc && docset.seek(doc) == doc
-}
-
-/// Filters a given `DocSet` by removing the docs from a given `DocSet`.
+/// An exclusion set is a set of documents
+/// that should be excluded from a given DocSet.
 ///
-/// The excluding docset has no impact on scoring.
-pub struct Exclude<TDocSet, TDocSetExclude> {
-    underlying_docset: TDocSet,
-    excluding_docset: TDocSetExclude,
+/// It can be a single DocSet, or a Vec of DocSets.
+pub trait ExclusionSet: Send {
+    /// Returns `true` if the given `doc` is in the exclusion set.
+    fn contains(&mut self, doc: DocId) -> bool;
 }

-impl<TDocSet, TDocSetExclude> Exclude<TDocSet, TDocSetExclude>
+impl<TDocSet: DocSet> ExclusionSet for TDocSet {
+    #[inline]
+    fn contains(&mut self, doc: DocId) -> bool {
+        self.seek_danger(doc) == SeekDangerResult::Found
+    }
+}
+
+impl<TDocSet: DocSet> ExclusionSet for Vec<TDocSet> {
+    #[inline]
+    fn contains(&mut self, doc: DocId) -> bool {
+        for docset in self.iter_mut() {
+            if docset.seek_danger(doc) == SeekDangerResult::Found {
+                return true;
+            }
+        }
+        false
+    }
+}
+
+/// Filters a given `DocSet` by removing the docs from an exclusion set.
+///
+/// The excluding docsets have no impact on scoring.
+pub struct Exclude<TDocSet, TExclusionSet> {
+    underlying_docset: TDocSet,
+    exclusion_set: TExclusionSet,
+}
+
+impl<TDocSet, TExclusionSet> Exclude<TDocSet, TExclusionSet>
 where
    TDocSet: DocSet,
-    TDocSetExclude: DocSet,
+    TExclusionSet: ExclusionSet,
 {
    /// Creates a new `ExcludeScorer`
    pub fn new(
        mut underlying_docset: TDocSet,
-        mut excluding_docset: TDocSetExclude,
-    ) -> Exclude<TDocSet, TDocSetExclude> {
+        mut exclusion_set: TExclusionSet,
+    ) -> Exclude<TDocSet, TExclusionSet> {
        while underlying_docset.doc() != TERMINATED {
            let target = underlying_docset.doc();
-            if !is_within(&mut excluding_docset, target) {
+            if !exclusion_set.contains(target) {
                break;
            }
            underlying_docset.advance();
        }
        Exclude {
            underlying_docset,
-            excluding_docset,
+            exclusion_set,
        }
    }
 }

-impl<TDocSet, TDocSetExclude> DocSet for Exclude<TDocSet, TDocSetExclude>
+impl<TDocSet, TExclusionSet> DocSet for Exclude<TDocSet, TExclusionSet>
 where
    TDocSet: DocSet,
-    TDocSetExclude: DocSet,
+    TExclusionSet: ExclusionSet,
 {
    fn advance(&mut self) -> DocId {
        loop {
@@ -50,7 +73,7 @@ where
            if candidate == TERMINATED {
                return TERMINATED;
            }
-            if !is_within(&mut self.excluding_docset, candidate) {
+            if !self.exclusion_set.contains(candidate) {
                return candidate;
            }
        }
@@ -61,7 +84,7 @@ where
        if candidate == TERMINATED {
            return TERMINATED;
        }
-        if !is_within(&mut self.excluding_docset, candidate) {
+        if !self.exclusion_set.contains(candidate) {
            return candidate;
        }
        self.advance()
@@ -79,10 +102,10 @@ where
    }
 }

-impl<TScorer, TDocSetExclude> Scorer for Exclude<TScorer, TDocSetExclude>
+impl<TScorer, TExclusionSet> Scorer for Exclude<TScorer, TExclusionSet>
 where
    TScorer: Scorer,
-    TDocSetExclude: DocSet + 'static,
+    TExclusionSet: ExclusionSet + 'static,
 {
    #[inline]
    fn score(&mut self) -> Score {
--- a/src/query/intersection.rs
+++ b/src/query/intersection.rs
@@ -84,6 +84,14 @@ impl<TDocSet: DocSet> Intersection<TDocSet, TDocSet> {
        docsets.sort_by_key(|docset| docset.cost());
        go_to_first_doc(&mut docsets);
        let left = docsets.remove(0);
+        debug_assert!({
+            let doc = left.doc();
+            if doc == TERMINATED {
+                true
+            } else {
+                docsets.iter().all(|docset| docset.doc() == doc)
+            }
+        });
        let right = docsets.remove(0);
        Intersection {
            left,
@@ -112,30 +120,24 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
        // Invariant:
        // - candidate is always <= to the next document in the intersection.
        // - candidate strictly increases at every occurence of the loop.
-        let mut candidate = 0;
+        let mut candidate = left.doc() + 1;

        // Termination: candidate strictly increases.
        'outer: while candidate < TERMINATED {
            // As we enter the loop, we should always have candidate < next_doc.

-            // This step always increases candidate.
-            //
-            // TODO: Think about which value would make sense here
-            // It depends on the DocSet implementation, when a seek would outweigh an advance.
-            candidate = if candidate > left.doc().wrapping_add(100) {
-                left.seek(candidate)
-            } else {
-                left.advance()
-            };
+            candidate = left.seek(candidate);

            // Left is positionned on `candidate`.
            debug_assert_eq!(left.doc(), candidate);

            if let SeekDangerResult::SeekLowerBound(seek_lower_bound) = right.seek_danger(candidate)
            {
-                // The max is technically useless but it makes the invariant
-                // easier to proofread.
-                debug_assert!(seek_lower_bound >= candidate);
+                debug_assert!(
+                    seek_lower_bound == TERMINATED || seek_lower_bound > candidate,
+                    "seek_lower_bound {seek_lower_bound} must be greater than candidate \
+                     {candidate}"
+                );
                candidate = seek_lower_bound;
                continue;
            }
@@ -148,7 +150,11 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
                    other.seek_danger(candidate)
                {
                    // One of the scorer does not match, let's restart at the top of the loop.
-                    debug_assert!(seek_lower_bound >= candidate);
+                    debug_assert!(
+                        seek_lower_bound == TERMINATED || seek_lower_bound > candidate,
+                        "seek_lower_bound {seek_lower_bound} must be greater than candidate \
+                         {candidate}"
+                    );
                    candidate = seek_lower_bound;
                    continue 'outer;
                }
@@ -238,9 +244,12 @@ mod tests {
    use proptest::prelude::*;

    use super::Intersection;
+    use crate::collector::Count;
    use crate::docset::{DocSet, TERMINATED};
    use crate::postings::tests::test_skip_against_unoptimized;
-    use crate::query::VecDocSet;
+    use crate::query::{QueryParser, VecDocSet};
+    use crate::schema::{Schema, TEXT};
+    use crate::Index;

    #[test]
    fn test_intersection() {
@@ -411,4 +420,29 @@ mod tests {
            assert_eq!(intersection.doc(), TERMINATED);
        }
    }
+
+    #[test]
+    fn test_bug_2811_intersection_candidate_should_increase() {
+        let mut schema_builder = Schema::builder();
+        let text_field = schema_builder.add_text_field("text", TEXT);
+        let schema = schema_builder.build();
+
+        let index = Index::create_in_ram(schema);
+        let mut writer = index.writer_for_tests().unwrap();
+        writer
+            .add_document(doc!(text_field=>"hello happy tax"))
+            .unwrap();
+        writer.add_document(doc!(text_field=>"hello")).unwrap();
+        writer.add_document(doc!(text_field=>"hello")).unwrap();
+        writer.add_document(doc!(text_field=>"happy tax")).unwrap();
+
+        writer.commit().unwrap();
+        let query_parser = QueryParser::for_index(&index, Vec::new());
+        let query = query_parser
+            .parse_query(r#"+text:hello +text:"happy tax""#)
+            .unwrap();
+        let searcher = index.reader().unwrap().searcher();
+        let c = searcher.search(&*query, &Count).unwrap();
+        assert_eq!(c, 1);
+    }
 }
--- a/src/query/mod.rs
+++ b/src/query/mod.rs
@@ -43,7 +43,7 @@ pub use self::boost_query::{BoostQuery, BoostWeight};
 pub use self::const_score_query::{ConstScoreQuery, ConstScorer};
 pub use self::disjunction_max_query::DisjunctionMaxQuery;
 pub use self::empty_query::{EmptyQuery, EmptyScorer, EmptyWeight};
-pub use self::exclude::Exclude;
+pub use self::exclude::{Exclude, ExclusionSet};
 pub use self::exist_query::ExistsQuery;
 pub use self::explanation::Explanation;
 #[cfg(test)]
--- a/src/query/phrase_query/phrase_scorer.rs
+++ b/src/query/phrase_query/phrase_scorer.rs
@@ -531,7 +531,12 @@ impl<TPostings: Postings> DocSet for PhraseScorer<TPostings> {
    }

    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
-        debug_assert!(target >= self.doc());
+        debug_assert!(
+            target >= self.doc(),
+            "target ({}) should be greater than or equal to doc ({})",
+            target,
+            self.doc()
+        );
        let seek_res = self.intersection_docset.seek_danger(target);
        if seek_res != SeekDangerResult::Found {
            return seek_res;
--- a/src/query/query_parser/query_parser.rs
+++ b/src/query/query_parser/query_parser.rs
@@ -2068,6 +2068,16 @@ mod test {
            format!("Regex(Field(0), {:#?})", expected_regex).as_str(),
            false,
        );
+        let expected_regex2 = tantivy_fst::Regex::new(r".*a").unwrap();
+        test_parse_query_to_logical_ast_helper(
+            "title:(/.*b/ OR /.*a/)",
+            format!(
+                "(Regex(Field(0), {:#?}) Regex(Field(0), {:#?}))",
+                expected_regex, expected_regex2
+            )
+            .as_str(),
+            false,
+        );

        // Invalid field
        let err = parse_query_to_logical_ast("float:/.*b/", false).unwrap_err();
--- a/src/query/range_query/mod.rs
+++ b/src/query/range_query/mod.rs
@@ -19,7 +19,8 @@ pub(crate) fn is_type_valid_for_fastfield_range_query(typ: Type) -> bool {
        | Type::Bool
        | Type::Date
        | Type::Json
-        | Type::IpAddr => true,
-        Type::Facet | Type::Bytes => false,
+        | Type::IpAddr
+        | Type::Bytes => true,
+        Type::Facet => false,
    }
 }
--- a/src/query/range_query/range_query_fastfield.rs
+++ b/src/query/range_query/range_query_fastfield.rs
@@ -6,8 +6,8 @@ use std::net::Ipv6Addr;
 use std::ops::{Bound, RangeInclusive};

 use columnar::{
-    Cardinality, Column, ColumnType, MonotonicallyMappableToU128, MonotonicallyMappableToU64,
-    NumericalType, StrColumn,
+    BytesColumn, Cardinality, Column, ColumnType, MonotonicallyMappableToU128,
+    MonotonicallyMappableToU64, NumericalType, StrColumn,
 };
 use common::bounds::{BoundsRange, TransformBound};

@@ -163,6 +163,25 @@ impl Weight for FastFieldRangeWeight {
            };
            let dict = str_dict_column.dictionary();

+            let bounds = self.bounds.map_bound(get_value_bytes);
+            // Get term ids for terms
+            let (lower_bound, upper_bound) =
+                dict.term_bounds_to_ord(bounds.lower_bound, bounds.upper_bound)?;
+            let fast_field_reader = reader.fast_fields();
+            let Some((column, _col_type)) =
+                fast_field_reader.u64_lenient_for_type(None, &field_name)?
+            else {
+                return Ok(Box::new(EmptyScorer));
+            };
+            search_on_u64_ff(column, boost, BoundsRange::new(lower_bound, upper_bound))
+        } else if field_type.is_bytes() {
+            let Some(bytes_column): Option<BytesColumn> =
+                reader.fast_fields().bytes(&field_name)?
+            else {
+                return Ok(Box::new(EmptyScorer));
+            };
+            let dict = bytes_column.dictionary();
+
            let bounds = self.bounds.map_bound(get_value_bytes);
            // Get term ids for terms
            let (lower_bound, upper_bound) =
@@ -1402,6 +1421,66 @@ mod tests {

        Ok(())
    }
+
+    #[test]
+    fn test_bytes_field_ff_range_query() -> crate::Result<()> {
+        use crate::schema::BytesOptions;
+
+        let mut schema_builder = Schema::builder();
+        let bytes_field = schema_builder
+            .add_bytes_field("data", BytesOptions::default().set_fast().set_indexed());
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema.clone());
+        let mut index_writer: IndexWriter = index.writer_for_tests()?;
+
+        // Insert documents with lexicographically sortable byte values
+        // Using simple byte sequences that have clear ordering
+        let values: Vec<Vec<u8>> = vec![
+            vec![0x00, 0x10],
+            vec![0x00, 0x20],
+            vec![0x00, 0x30],
+            vec![0x01, 0x00],
+            vec![0x01, 0x10],
+            vec![0x02, 0x00],
+        ];
+
+        for value in &values {
+            let mut doc = TantivyDocument::new();
+            doc.add_bytes(bytes_field, value);
+            index_writer.add_document(doc)?;
+        }
+        index_writer.commit()?;
+
+        let reader = index.reader()?;
+        let searcher = reader.searcher();
+
+        // Test: Range query [0x00, 0x20] to [0x01, 0x00] (inclusive)
+        // Should match: [0x00, 0x20], [0x00, 0x30], [0x01, 0x00]
+        let lower = Term::from_field_bytes(bytes_field, &[0x00, 0x20]);
+        let upper = Term::from_field_bytes(bytes_field, &[0x01, 0x00]);
+        let range_query = RangeQuery::new(Bound::Included(lower), Bound::Included(upper));
+        let count = searcher.search(&range_query, &Count)?;
+        assert_eq!(
+            count, 3,
+            "Expected 3 documents in range [0x00,0x20] to [0x01,0x00]"
+        );
+
+        // Test: Range query > [0x01, 0x00] (exclusive lower bound)
+        // Should match: [0x01, 0x10], [0x02, 0x00]
+        let lower = Term::from_field_bytes(bytes_field, &[0x01, 0x00]);
+        let range_query = RangeQuery::new(Bound::Excluded(lower), Bound::Unbounded);
+        let count = searcher.search(&range_query, &Count)?;
+        assert_eq!(count, 2, "Expected 2 documents > [0x01,0x00]");
+
+        // Test: Range query < [0x00, 0x30] (exclusive upper bound)
+        // Should match: [0x00, 0x10], [0x00, 0x20]
+        let upper = Term::from_field_bytes(bytes_field, &[0x00, 0x30]);
+        let range_query = RangeQuery::new(Bound::Unbounded, Bound::Excluded(upper));
+        let count = searcher.search(&range_query, &Count)?;
+        assert_eq!(count, 2, "Expected 2 documents < [0x00,0x30]");
+
+        Ok(())
+    }
 }

 #[cfg(test)]
--- a/src/query/term_query/term_scorer.rs
+++ b/src/query/term_query/term_scorer.rs
@@ -105,6 +105,7 @@ impl DocSet for TermScorer {

    #[inline]
    fn seek(&mut self, target: DocId) -> DocId {
+        debug_assert!(target >= self.doc());
        self.postings.seek(target)
    }

--- a/src/schema/field_type.rs
+++ b/src/schema/field_type.rs
@@ -223,6 +223,11 @@ impl FieldType {
        matches!(self, FieldType::Str(_))
    }

+    /// returns true if this is a bytes field
+    pub fn is_bytes(&self) -> bool {
+        matches!(self, FieldType::Bytes(_))
+    }
+
    /// returns true if this is an date field
    pub fn is_date(&self) -> bool {
        matches!(self, FieldType::Date(_))
--- a/src/space_usage/mod.rs
+++ b/src/space_usage/mod.rs
@@ -124,7 +124,6 @@ impl SegmentSpaceUsage {
            FieldNorms => PerField(self.fieldnorms().clone()),
            Terms => PerField(self.termdict().clone()),
            SegmentComponent::Store => ComponentSpaceUsage::Store(self.store().clone()),
-            SegmentComponent::TempStore => ComponentSpaceUsage::Store(self.store().clone()),
            Delete => Basic(self.deletes()),
        }
    }
--- a/sstable/src/lib.rs
+++ b/sstable/src/lib.rs
@@ -302,8 +302,9 @@ where
            || self.previous_key[keep_len] < key[keep_len];
        assert!(
            increasing_keys,
-            "Keys should be increasing. ({:?} > {key:?})",
-            self.previous_key
+            "Keys should be increasing. ({:?} > {:?})",
+            String::from_utf8_lossy(&self.previous_key),
+            String::from_utf8_lossy(key),
        );
        self.previous_key.resize(key.len(), 0u8);
        self.previous_key[keep_len..].copy_from_slice(&key[keep_len..]);
Author	SHA1	Message	Date
Pascal Seitz	3916bef3b9	update CHANGELOG	2026-03-18 10:41:00 +01:00
Paul Masurel	68a9066d13	Fix format (#2852 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-03-16 10:43:39 +01:00
Paul Masurel	d02559a4d1	Update time deps to defensively address a vulnerability. (#2850 ) Closes #2849 Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-03-12 16:47:11 +01:00
Anas Limem	1922abaf33	Fixed integer overflow in segment sorting and merge policy truncation (#2846 )	2026-03-12 16:44:38 +01:00
trinity-1686a	d0c5ffb0aa	Merge pull request #2842 from quickwit-oss/congxie/replaceHll Use sketches-ddsketch fork with Java-compatible binary encoding	2026-02-20 16:56:56 +01:00
cong.xie	18fedd9384	Fix nightly fmt: merge crate imports in percentiles tests Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 14:18:54 -05:00
cong.xie	2098fca47f	Restore use_serde feature and simplify PercentilesCollector Keep use_serde on sketches-ddsketch so DDSketch derives Serialize/Deserialize, removing the need for custom impls on PercentilesCollector. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 14:13:17 -05:00
cong.xie	1251b40c93	Drop use_serde feature; use Java binary encoding for PercentilesCollector Replace the derived Serialize/Deserialize on PercentilesCollector with custom impls that use DDSketch's Java-compatible binary encoding (encode_to_java_bytes / decode_from_java_bytes). This removes the need for the use_serde feature on sketches-ddsketch entirely. Also restore original float test values and use assert_nearly_equals! for all float comparisons in percentile tests, since DDSketch quantile estimates can have minor precision differences across platforms. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 13:32:28 -05:00
cong.xie	09a49b872c	Use assert_nearly_equals! for float comparisons in percentile test Address review feedback: replace assert_eq! with assert_nearly_equals! for float values that go through JSON serialization roundtrips, which can introduce minor precision differences. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 13:21:48 -05:00
cong.xie	b9ace002ce	Replace vendored sketches-ddsketch with git dependency Move the vendored sketches-ddsketch crate (with Java-compatible binary encoding) to its own repo at quickwit-oss/rust-sketches-ddsketch and reference it via git+rev in Cargo.toml. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-19 12:22:19 -05:00
cong.xie	2dc4e9ef78	fix: resolve remaining clippy errors in ddsketch - Replace approximate PI/E constants with non-famous value in test - Fix reversed empty range (2048..0) → (0..2048).rev() in store test Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 15:54:27 -05:00
cong.xie	aeea65f61d	refactor: rewrite encoding.rs with idiomatic Rust - Replace bare constants with FlagType and BinEncodingMode enums - Use const fn for flag byte construction instead of raw bit ops - Replace if-else chain with nested match in decode_from_java_bytes - Use split_first() in read_byte for idiomatic slice consumption - Use split_at in read_f64_le to avoid TryInto on edition 2018 - Use u64::from(next) instead of `next as u64` casts - Extract assert_golden, assert_quantiles_match, bytes_to_hex helpers to reduce duplication across golden byte tests - Fix edition-2018 assert! format string compatibility - Clean up is_valid_flag_byte with let-else and match Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 15:49:12 -05:00
cong.xie	4211d5a1ed	fix: resolve clippy warnings in vendored sketches-ddsketch - manual_range_contains: use !(0.0..=1.0).contains(&q) - identity_op: simplify (0 << 2) \| FLAG_TYPE to just FLAG_TYPE - manual_clamp: use .clamp(0, 8) instead of .max(0).min(8) - manual_repeat_n: use repeat_n() instead of repeat().take() - cast_abs_to_unsigned: use .unsigned_abs() instead of .abs() as usize Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 13:36:06 -05:00
cong.xie	d50c7a1daf	Add Java source links for cross-language alignment comments Reference the exact Java source files in DataDog/sketches-java for Config::new(), Config::key(), Config::value(), Config::from_gamma(), and Store::add_count() so readers can verify the alignment. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 13:25:12 -05:00
cong.xie	cf760fd5b6	fix: remove internal reference from code comment Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 12:59:25 -05:00
cong.xie	df04c7d8f1	fix: rustfmt nightly formatting for vendored sketches-ddsketch Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 12:53:01 -05:00
cong.xie	68626bf3a1	Vendor sketches-ddsketch with Java-compatible binary encoding Fork sketches-ddsketch as a workspace member to add native Java binary serialization (to_java_bytes/from_java_bytes) for DDSketch. This enables pomsky to return raw DDSketch bytes that event-query can deserialize via DDSketchWithExactSummaryStatistics.decode(). Key changes: - Vendor sketches-ddsketch crate with encoding.rs implementing VarEncoding, flag bytes, and INDEX_DELTAS_AND_COUNTS store format - Align Config::key() to floor-based indexing matching Java's LogarithmicMapping - Add PercentilesCollector::to_sketch_bytes() for pomsky integration - Cross-language golden byte tests verified byte-identical with Java output Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-18 11:36:21 -05:00
Adrien Guillo	51f340f83d	Merge pull request #2837 from quickwit-oss/congxie/replaceHll Replace hyperloglogplus with Apache DataSketches HLL (lg_k=11)	2026-02-12 17:19:40 -05:00
cong.xie	7eca33143e	Remove Datadog-specific references from comments This is an open-source repo — replace references to Datadog's event query with generic cross-language compatibility descriptions.	2026-02-12 11:44:42 -05:00
cong.xie	698f073f88	fix fmt	2026-02-11 15:52:39 -05:00
cong.xie	cdd24b7ee5	Replace hyperloglogplus with Apache DataSketches HLL (lg_k=11) Switch tantivy's cardinality aggregation from the hyperloglogplus crate (HyperLogLog++ with p=16) to the official Apache DataSketches HLL implementation (datasketches crate v0.2.0 with lg_k=11, Hll4). This enables returning raw HLL sketch bytes from pomsky to Datadog's event query, where they can be properly deserialized and merged using the same DataSketches library (Java). The previous implementation required pomsky to fabricate fake HLL sketches from scalar cardinality estimates, which produced incorrect results when merged. Changes: - Cargo.toml: hyperloglogplus 0.4.1 -> datasketches 0.2.0 - CardinalityCollector: HyperLogLogPlus<u64, BuildSaltedHasher> -> HllSketch - Custom Serde impl using HllSketch binary format (cross-shard compat) - New to_sketch_bytes() for external consumers (pomsky) - Salt preserved via (salt, value) tuple hashing for column type disambiguation - Removed BuildSaltedHasher struct - Added 4 new unit tests (serde roundtrip, merge, binary compat, salt)	2026-02-11 08:49:46 -05:00
PSeitz	57fe659fff	make serializer pub (#2835 ) some changes on the posting list serializer to make it usable in other contexts. Improve errors Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>	2026-02-11 14:37:42 +01:00
trinity-1686a	5562ce6037	Merge pull request #2818 from Darkheir/fix/query_grammar_regex_between_parentheses	2026-02-11 11:39:58 +01:00
Metin Dumandag	09b6ececa7	Export fields of the PercentileValuesVecEntry (#2833 ) Otherwise, there is no way to access these fields when not using the json serialized form of the aggregation results. This simple data struct is part of the public api, so its fields should be accessible as well.	2026-02-11 11:31:07 +01:00
Moe	8018016e46	feat: add fast field support for Bytes type (#100 ) (#2830 ) ## What Enable range queries and TopN sorting on `Bytes` fast fields, bringing them to parity with `Str` fields. ## Why `BytesColumn` uses the same dictionary encoding as `StrColumn` internally, but range queries and TopN sorting were explicitly disabled for `Bytes`. This prevented use cases like storing lexicographically sortable binary data (e.g., arbitrary-precision decimals) that need efficient range filtering. ## How 1. Enable range queries for Bytes - Changed `is_type_valid_for_fastfield_range_query()` to return `true` for `Type::Bytes` 2. Add BytesColumn handling in scorer - Added a branch in `FastFieldRangeWeight::scorer()` to handle bytes fields using dictionary ordinal lookup (mirrors the existing `StrColumn` logic) 3. Add SortByBytes - New sort key computer for TopN queries on bytes columns ## Tests - `test_bytes_field_ff_range_query` - Tests inclusive/exclusive bounds and unbounded ranges - `test_sort_by_bytes_asc` / `test_sort_by_bytes_desc` - Tests lexicographic ordering in both directions	2026-02-11 11:26:18 +01:00
trinity-1686a	6bf185dc3f	Merge pull request #2829 from quickwit-oss/cong.xie/add-intermediate-accessors	2026-02-10 17:07:24 +01:00
cong.xie	bb141abe22	feat(aggregation): add keys() accessor to IntermediateAggregationResults	2026-02-09 15:38:35 -05:00
cong.xie	f1c29ba972	resolve conflcit	2026-02-06 14:23:11 -05:00
cong.xie	ae0554a6a5	feat(aggregation): add public accessors for intermediate aggregation results Add accessor methods to allow external crates to read intermediate aggregation results without accessing pub(crate) fields: - IntermediateAggregationResults: get(), remove() - IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound() - IntermediateAverage: stats() - IntermediateStats: count(), sum() - IntermediateKey: Display impl for string conversion	2026-02-06 11:12:20 -05:00
cong.xie	0d7abe5d23	feat(aggregation): add public accessors for intermediate aggregation results Add accessor methods to allow external crates to read intermediate aggregation results without accessing pub(crate) fields: - IntermediateAggregationResults: get(), get_mut(), remove() - IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound() - IntermediateAverage: stats() - IntermediateStats: count(), sum() - IntermediateKey: Display impl for string conversion	2026-02-06 10:28:59 -05:00
PSeitz	28db952131	Add regex search and merge segments benchmark (#2826 ) * add merge_segments benchmark * add regex search bench	2026-02-02 17:28:02 +01:00
PSeitz	98ebbf922d	faster exclude queries (#2825 ) * faster exclude queries Faster exclude queries with multiple terms. Changes `Exclude` to be able to exclude multiple DocSets, instead of putting the docsets into a union. Use `seek_danger` in `Exclude`. closes #2822 * replace unwrap with match	2026-01-30 17:06:41 +01:00
Paul Masurel	4a89e74597	Fix rfc3339 typos and add Claude Code skills (#2823 ) Closes #2817	2026-01-30 12:00:28 +01:00
Alex Lazar	4d99e51e50	Bump oneshot to 0.1.13 per dependabot (#2821 )	2026-01-30 11:42:01 +01:00
Darkheir	a55e4069e4	feat(query-grammar): Apply PR review suggestions Signed-off-by: Darkheir <raphael.cohen@sekoia.io>	2026-01-28 14:13:55 +01:00
Darkheir	1fd30c62be	fix(query-grammar): Fix regexes between parentheses Signed-off-by: Darkheir <raphael.cohen@sekoia.io>	2026-01-28 10:37:51 +01:00
trinity-1686a	9b619998bd	Merge pull request #2816 from evance-br/fix-closing-paren-elastic-range	2026-01-27 17:00:08 +01:00
Evance Soumaoro	765c448945	uncomment commented code when testing	2026-01-27 13:19:41 +00:00
Evance Soumaoro	943594ebaa	uncomment commented code when testing	2026-01-27 13:08:38 +00:00
Evance Soumaoro	df17daae0d	fix closing parenthesis error on elastic range queries for lenient parser	2026-01-27 13:01:14 +00:00
Paul Masurel	0ae94baef5	Remove temp file (#2815 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-27 09:22:11 +01:00
Paul Masurel	3f448ecf79	Bugfix on intersection. (#2812 ) The intersection algorithm made it possible for .seek(..) with values lower than the current doc id, breaking the DocSet contract. The fix removes the optimization that caused left.seek(..) to be replaced by a simpler left.advance(..). Simply doing so lead to a performance regression. I therefore integrated that idea within SegmentPostings.seek. We now attempt to check the next doc systematically on seek, PROVIDED the block is already loaded. Closes #2811 Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-27 09:21:09 +01:00