fix fmt

Replace hyperloglogplus with Apache DataSketches HLL (lg_k=11)
Switch tantivy's cardinality aggregation from the hyperloglogplus crate (HyperLogLog++ with p=16) to the official Apache DataSketches HLL implementation (datasketches crate v0.2.0 with lg_k=11, Hll4). This enables returning raw HLL sketch bytes from pomsky to Datadog's event query, where they can be properly deserialized and merged using the same DataSketches library (Java). The previous implementation required pomsky to fabricate fake HLL sketches from scalar cardinality estimates, which produced incorrect results when merged. Changes: - Cargo.toml: hyperloglogplus 0.4.1 -> datasketches 0.2.0 - CardinalityCollector: HyperLogLogPlus<u64, BuildSaltedHasher> -> HllSketch - Custom Serde impl using HllSketch binary format (cross-shard compat) - New to_sketch_bytes() for external consumers (pomsky) - Salt preserved via (salt, value) tuple hashing for column type disambiguation - Removed BuildSaltedHasher struct - Added 4 new unit tests (serde roundtrip, merge, binary compat, salt)
2026-02-12 02:50:37 +00:00 · 2026-02-11 15:52:39 -05:00 · 2026-02-11 08:49:46 -05:00 · 2026-02-11 11:39:58 +01:00 · 2026-02-11 11:31:07 +01:00 · 2026-02-11 11:26:18 +01:00
78 changed files with 2051 additions and 399 deletions
--- a/.claude/skills/rationalize-deps/SKILL.md
+++ b/.claude/skills/rationalize-deps/SKILL.md
@@ -0,0 +1,125 @@
+---
+name: rationalize-deps
+description: Analyze Cargo.toml dependencies and attempt to remove unused features to reduce compile times and binary size
+---
+
+# Rationalize Dependencies
+
+This skill analyzes Cargo.toml dependencies to identify and remove unused features.
+
+## Overview
+
+Many crates enable features by default that may not be needed. This skill:
+1. Identifies dependencies with default features enabled
+2. Tests if `default-features = false` works
+3. Identifies which specific features are actually needed
+4. Verifies compilation after changes
+
+## Step 1: Identify the target
+
+Ask the user which crate(s) to analyze:
+- A specific crate name (e.g., "tokio", "serde")
+- A specific workspace member (e.g., "quickwit-search")
+- "all" to scan the entire workspace
+
+## Step 2: Analyze current dependencies
+
+For the workspace Cargo.toml (`quickwit/Cargo.toml`), list dependencies that:
+- Do NOT have `default-features = false`
+- Have default features that might be unnecessary
+
+Run: `cargo tree -p <crate> -f "{p} {f}" --edges features` to see what features are actually used.
+
+## Step 3: For each candidate dependency
+
+### 3a: Check the crate's default features
+
+Look up the crate on crates.io or check its Cargo.toml to understand:
+- What features are enabled by default
+- What each feature provides
+
+Use: `cargo metadata --format-version=1 | jq '.packages[] | select(.name == "<crate>") | .features'`
+
+### 3b: Try disabling default features
+
+Modify the dependency in `quickwit/Cargo.toml`:
+
+From:
+```toml
+some-crate = { version = "1.0" }
+```
+
+To:
+```toml
+some-crate = { version = "1.0", default-features = false }
+```
+
+### 3c: Run cargo check
+
+Run: `cargo check --workspace` (or target specific packages for faster feedback)
+
+If compilation fails:
+1. Read the error messages to identify which features are needed
+2. Add only the required features explicitly:
+   ```toml
+   some-crate = { version = "1.0", default-features = false, features = ["needed-feature"] }
+   ```
+3. Re-run cargo check
+
+### 3d: Binary search for minimal features
+
+If there are many default features, use binary search:
+1. Start with no features
+2. If it fails, add half the default features
+3. Continue until you find the minimal set
+
+## Step 4: Document findings
+
+For each dependency analyzed, report:
+- Original configuration
+- New configuration (if changed)
+- Features that were removed
+- Any features that are required
+
+## Step 5: Verify full build
+
+After all changes, run:
+```bash
+cargo check --workspace --all-targets
+cargo test --workspace --no-run
+```
+
+## Common Patterns
+
+### Serde
+Often only needs `derive`:
+```toml
+serde = { version = "1.0", default-features = false, features = ["derive", "std"] }
+```
+
+### Tokio
+Identify which runtime features are actually used:
+```toml
+tokio = { version = "1.0", default-features = false, features = ["rt-multi-thread", "macros", "sync"] }
+```
+
+### Reqwest
+Often doesn't need all TLS backends:
+```toml
+reqwest = { version = "0.11", default-features = false, features = ["rustls-tls", "json"] }
+```
+
+## Rollback
+
+If changes cause issues:
+```bash
+git checkout quickwit/Cargo.toml
+cargo check --workspace
+```
+
+## Tips
+
+- Start with large crates that have many default features (tokio, reqwest, hyper)
+- Use `cargo bloat --crates` to identify large dependencies
+- Check `cargo tree -d` for duplicate dependencies that might indicate feature conflicts
+- Some features are needed only for tests - consider using `[dev-dependencies]` features
--- a/.claude/skills/simple-pr/SKILL.md
+++ b/.claude/skills/simple-pr/SKILL.md
@@ -0,0 +1,60 @@
+---
+name: simple-pr
+description: Create a simple PR from staged changes with an auto-generated commit message
+disable-model-invocation: true
+---
+
+# Simple PR
+
+Follow these steps to create a simple PR from staged changes:
+
+## Step 1: Check workspace state
+
+Run: `git status`
+
+Verify that all changes have been staged (no unstaged changes). If there are unstaged changes, abort and ask the user to stage their changes first with `git add`.
+
+Also verify that we are on the `main` branch. If not, abort and ask the user to switch to main first.
+
+## Step 2: Ensure main is up to date
+
+Run: `git pull origin main`
+
+This ensures we're working from the latest code.
+
+## Step 3: Review staged changes
+
+Run: `git diff --cached`
+
+Review the staged changes to understand what the PR will contain.
+
+## Step 4: Generate commit message
+
+Based on the staged changes, generate a concise commit message (1-2 sentences) that describes the "why" rather than the "what".
+
+Display the proposed commit message to the user and ask for confirmation before proceeding.
+
+## Step 5: Create a new branch
+
+Get the git username: `git config user.name | tr ' ' '-' | tr '[:upper:]' '[:lower:]'`
+
+Create a short, descriptive branch name based on the changes (e.g., `fix-typo-in-readme`, `add-retry-logic`, `update-deps`).
+
+Create and checkout the branch: `git checkout -b {username}/{short-descriptive-name}`
+
+## Step 6: Commit changes
+
+Commit with the message from step 3:
+```
+git commit -m "{commit-message}"
+```
+
+## Step 7: Push and open a PR
+
+Push the branch and open a PR:
+```
+git push -u origin {branch-name}
+gh pr create --title "{commit-message-title}" --body "{longer-description-if-needed}"
+```
+
+Report the PR URL to the user when complete.
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -15,7 +15,7 @@ rust-version = "1.85"
 exclude = ["benches/*.json", "benches/*.txt"]

 [dependencies]
-oneshot = "0.1.7"
+oneshot = "0.1.13"
 base64 = "0.22.0"
 byteorder = "1.4.3"
 crc32fast = "1.3.2"
@@ -27,7 +27,7 @@ regex = { version = "1.5.5", default-features = false, features = [
 aho-corasick = "1.0"
 tantivy-fst = "0.5"
 memmap2 = { version = "0.9.0", optional = true }
-lz4_flex = { version = "0.11", default-features = false, optional = true }
+lz4_flex = { version = "0.12", default-features = false, optional = true }
 zstd = { version = "0.13", optional = true, default-features = false }
 tempfile = { version = "3.12.0", optional = true }
 log = "0.4.16"
@@ -50,7 +50,7 @@ fail = { version = "0.5.0", optional = true }
 time = { version = "0.3.35", features = ["serde-well-known"] }
 smallvec = "1.8.0"
 rayon = "1.5.2"
-lru = "0.12.0"
+lru = "0.16.3"
 fastdivide = "0.4.0"
 itertools = "0.14.0"
 measure_time = "0.9.0"
@@ -65,7 +65,7 @@ tantivy-bitpacker = { version = "0.9", path = "./bitpacker" }
 common = { version = "0.10", path = "./common/", package = "tantivy-common" }
 tokenizer-api = { version = "0.6", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
 sketches-ddsketch = { version = "0.3.0", features = ["use_serde"] }
-hyperloglogplus = { version = "0.4.1", features = ["const-loop"] }
+datasketches = "0.2.0"
 futures-util = { version = "0.3.28", optional = true }
 futures-channel = { version = "0.3.28", optional = true }
 fnv = "1.0.7"
@@ -76,7 +76,7 @@ winapi = "0.3.9"

 [dev-dependencies]
 binggan = "0.14.2"
-rand = "0.8.5"
+rand = "0.9"
 maplit = "1.0.2"
 matches = "0.1.9"
 pretty_assertions = "1.2.1"
@@ -85,7 +85,7 @@ test-log = "0.2.10"
 futures = "0.3.21"
 paste = "1.0.11"
 more-asserts = "0.3.1"
-rand_distr = "0.4.3"
+rand_distr = "0.5"
 time = { version = "0.3.10", features = ["serde-well-known", "macros"] }
 postcard = { version = "1.0.4", features = [
    "use-std",
@@ -189,3 +189,16 @@ harness = false
 [[bench]]
 name = "bool_queries_with_range"
 harness = false
+
+[[bench]]
+name = "str_search_and_get"
+harness = false
+
+[[bench]]
+name = "merge_segments"
+harness = false
+
+[[bench]]
+name = "regex_all_terms"
+harness = false
+
--- a/benches/agg_bench.rs
+++ b/benches/agg_bench.rs
@@ -1,8 +1,8 @@
 use binggan::plugins::PeakMemAllocPlugin;
 use binggan::{black_box, InputGroup, PeakMemAlloc, INSTRUMENTED_SYSTEM};
-use rand::distributions::WeightedIndex;
-use rand::prelude::SliceRandom;
+use rand::distr::weighted::WeightedIndex;
 use rand::rngs::StdRng;
+use rand::seq::IndexedRandom;
 use rand::{Rng, SeedableRng};
 use rand_distr::Distribution;
 use serde_json::json;
@@ -532,7 +532,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
    // Prepare 1000 unique terms sampled using a Zipf distribution.
    // Exponent ~1.1 approximates top-20 terms covering around ~20%.
    let terms_1000: Vec<String> = (1..=1000).map(|i| format!("term_{i}")).collect();
-    let zipf_1000 = rand_distr::Zipf::new(1000, 1.1f64).unwrap();
+    let zipf_1000 = rand_distr::Zipf::new(1000.0, 1.1f64).unwrap();

    {
        let mut rng = StdRng::from_seed([1u8; 32]);
@@ -576,8 +576,8 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
        }
        let _val_max = 1_000_000.0;
        for _ in 0..doc_with_value {
-            let val: f64 = rng.gen_range(0.0..1_000_000.0);
-            let json = if rng.gen_bool(0.1) {
+            let val: f64 = rng.random_range(0.0..1_000_000.0);
+            let json = if rng.random_bool(0.1) {
                // 10% are numeric values
                json!({ "mixed_type": val })
            } else {
@@ -586,7 +586,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
            index_writer.add_document(doc!(
                text_field => "cool",
                json_field => json,
-                text_field_all_unique_terms => format!("unique_term_{}", rng.gen::<u64>()),
+                text_field_all_unique_terms => format!("unique_term_{}", rng.random::<u64>()),
                text_field_many_terms => many_terms_data.choose(&mut rng).unwrap().to_string(),
                text_field_few_terms_status => status_field_data[log_level_distribution.sample(&mut rng)].0,
                text_field_1000_terms_zipf => terms_1000[zipf_1000.sample(&mut rng) as usize - 1].as_str(),
--- a/benches/and_or_queries.rs
+++ b/benches/and_or_queries.rs
@@ -55,29 +55,29 @@ fn build_shared_indices(num_docs: usize, p_a: f32, p_b: f32, p_c: f32) -> (Bench
    {
        let mut writer = index.writer_with_num_threads(1, 500_000_000).unwrap();
        for _ in 0..num_docs {
-            let has_a = rng.gen_bool(p_a as f64);
-            let has_b = rng.gen_bool(p_b as f64);
-            let has_c = rng.gen_bool(p_c as f64);
-            let score = rng.gen_range(0u64..100u64);
-            let score2 = rng.gen_range(0u64..100_000u64);
+            let has_a = rng.random_bool(p_a as f64);
+            let has_b = rng.random_bool(p_b as f64);
+            let has_c = rng.random_bool(p_c as f64);
+            let score = rng.random_range(0u64..100u64);
+            let score2 = rng.random_range(0u64..100_000u64);
            let mut title_tokens: Vec<&str> = Vec::new();
            let mut body_tokens: Vec<&str> = Vec::new();
            if has_a {
-                if rng.gen_bool(0.1) {
+                if rng.random_bool(0.1) {
                    title_tokens.push("a");
                } else {
                    body_tokens.push("a");
                }
            }
            if has_b {
-                if rng.gen_bool(0.1) {
+                if rng.random_bool(0.1) {
                    title_tokens.push("b");
                } else {
                    body_tokens.push("b");
                }
            }
            if has_c {
-                if rng.gen_bool(0.1) {
+                if rng.random_bool(0.1) {
                    title_tokens.push("c");
                } else {
                    body_tokens.push("c");
--- a/benches/bool_queries_with_range.rs
+++ b/benches/bool_queries_with_range.rs
@@ -36,13 +36,13 @@ fn build_shared_indices(num_docs: usize, p_title_a: f32, distribution: &str) ->
            "dense" => {
                for doc_id in 0..num_docs {
                    // Always add title to avoid empty documents
-                    let title_token = if rng.gen_bool(p_title_a as f64) {
+                    let title_token = if rng.random_bool(p_title_a as f64) {
                        "a"
                    } else {
                        "b"
                    };

-                    let num_rand = rng.gen_range(0u64..1000u64);
+                    let num_rand = rng.random_range(0u64..1000u64);

                    let num_asc = (doc_id / 10000) as u64;

@@ -60,13 +60,13 @@ fn build_shared_indices(num_docs: usize, p_title_a: f32, distribution: &str) ->
            "sparse" => {
                for doc_id in 0..num_docs {
                    // Always add title to avoid empty documents
-                    let title_token = if rng.gen_bool(p_title_a as f64) {
+                    let title_token = if rng.random_bool(p_title_a as f64) {
                        "a"
                    } else {
                        "b"
                    };

-                    let num_rand = rng.gen_range(0u64..10000000u64);
+                    let num_rand = rng.random_range(0u64..10000000u64);

                    let num_asc = doc_id as u64;

--- a/benches/merge_segments.rs
+++ b/benches/merge_segments.rs
@@ -0,0 +1,224 @@
+// Benchmarks segment merging
+//
+// Notes:
+// - Input segments are kept intact (no deletes / no IndexWriter merge).
+// - Output is written to a `NullDirectory` that discards all files except
+//  fieldnorms (needed for merging).
+
+use std::collections::HashMap;
+use std::io::{self, Write};
+use std::path::{Path, PathBuf};
+use std::sync::{Arc, RwLock};
+
+use binggan::{black_box, BenchRunner};
+use rand::prelude::*;
+use rand::rngs::StdRng;
+use rand::SeedableRng;
+use tantivy::directory::error::{DeleteError, OpenReadError, OpenWriteError};
+use tantivy::directory::{
+    AntiCallToken, Directory, FileHandle, OwnedBytes, TerminatingWrite, WatchCallback, WatchHandle,
+    WritePtr,
+};
+use tantivy::indexer::{merge_filtered_segments, NoMergePolicy};
+use tantivy::schema::{Schema, TEXT};
+use tantivy::{doc, HasLen, Index, IndexSettings, Segment};
+
+#[derive(Clone, Default, Debug)]
+struct NullDirectory {
+    blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
+}
+
+struct NullWriter;
+
+impl Write for NullWriter {
+    fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
+        Ok(buf.len())
+    }
+
+    fn flush(&mut self) -> io::Result<()> {
+        Ok(())
+    }
+}
+
+impl TerminatingWrite for NullWriter {
+    fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
+        Ok(())
+    }
+}
+
+struct InMemoryWriter {
+    path: PathBuf,
+    buffer: Vec<u8>,
+    blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
+}
+
+impl Write for InMemoryWriter {
+    fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
+        self.buffer.extend_from_slice(buf);
+        Ok(buf.len())
+    }
+
+    fn flush(&mut self) -> io::Result<()> {
+        Ok(())
+    }
+}
+
+impl TerminatingWrite for InMemoryWriter {
+    fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
+        let bytes = OwnedBytes::new(std::mem::take(&mut self.buffer));
+        self.blobs.write().unwrap().insert(self.path.clone(), bytes);
+        Ok(())
+    }
+}
+
+#[derive(Debug, Default)]
+struct NullFileHandle;
+impl HasLen for NullFileHandle {
+    fn len(&self) -> usize {
+        0
+    }
+}
+impl FileHandle for NullFileHandle {
+    fn read_bytes(&self, _range: std::ops::Range<usize>) -> io::Result<OwnedBytes> {
+        unimplemented!()
+    }
+}
+
+impl Directory for NullDirectory {
+    fn get_file_handle(&self, path: &Path) -> Result<Arc<dyn FileHandle>, OpenReadError> {
+        if let Some(bytes) = self.blobs.read().unwrap().get(path) {
+            return Ok(Arc::new(bytes.clone()));
+        }
+        Ok(Arc::new(NullFileHandle))
+    }
+
+    fn delete(&self, _path: &Path) -> Result<(), DeleteError> {
+        Ok(())
+    }
+
+    fn exists(&self, _path: &Path) -> Result<bool, OpenReadError> {
+        Ok(true)
+    }
+
+    fn open_write(&self, path: &Path) -> Result<WritePtr, OpenWriteError> {
+        let path_buf = path.to_path_buf();
+        if path.to_string_lossy().ends_with(".fieldnorm") {
+            let writer = InMemoryWriter {
+                path: path_buf,
+                buffer: Vec::new(),
+                blobs: Arc::clone(&self.blobs),
+            };
+            Ok(io::BufWriter::new(Box::new(writer)))
+        } else {
+            Ok(io::BufWriter::new(Box::new(NullWriter)))
+        }
+    }
+
+    fn atomic_read(&self, path: &Path) -> Result<Vec<u8>, OpenReadError> {
+        if let Some(bytes) = self.blobs.read().unwrap().get(path) {
+            return Ok(bytes.as_slice().to_vec());
+        }
+        Err(OpenReadError::FileDoesNotExist(path.to_path_buf()))
+    }
+
+    fn atomic_write(&self, _path: &Path, _data: &[u8]) -> io::Result<()> {
+        Ok(())
+    }
+
+    fn sync_directory(&self) -> io::Result<()> {
+        Ok(())
+    }
+
+    fn watch(&self, _watch_callback: WatchCallback) -> tantivy::Result<WatchHandle> {
+        Ok(WatchHandle::empty())
+    }
+}
+
+struct MergeScenario {
+    #[allow(dead_code)]
+    index: Index,
+    segments: Vec<Segment>,
+    settings: IndexSettings,
+    label: String,
+}
+
+fn build_index(
+    num_segments: usize,
+    docs_per_segment: usize,
+    tokens_per_doc: usize,
+    vocab_size: usize,
+) -> MergeScenario {
+    let mut schema_builder = Schema::builder();
+    let body = schema_builder.add_text_field("body", TEXT);
+    let schema = schema_builder.build();
+    let index = Index::create_in_ram(schema.clone());
+
+    assert!(vocab_size > 0);
+    let total_tokens = num_segments * docs_per_segment * tokens_per_doc;
+    let use_unique_terms = vocab_size >= total_tokens;
+    let mut rng = StdRng::from_seed([7u8; 32]);
+    let mut next_token_id: u64 = 0;
+
+    {
+        let mut writer = index.writer_with_num_threads(1, 256_000_000).unwrap();
+        writer.set_merge_policy(Box::new(NoMergePolicy));
+        for _ in 0..num_segments {
+            for _ in 0..docs_per_segment {
+                let mut tokens = Vec::with_capacity(tokens_per_doc);
+                for _ in 0..tokens_per_doc {
+                    let token_id = if use_unique_terms {
+                        let id = next_token_id;
+                        next_token_id += 1;
+                        id
+                    } else {
+                        rng.random_range(0..vocab_size as u64)
+                    };
+                    tokens.push(format!("term_{token_id}"));
+                }
+                writer.add_document(doc!(body => tokens.join(" "))).unwrap();
+            }
+            writer.commit().unwrap();
+        }
+    }
+
+    let segments = index.searchable_segments().unwrap();
+    let settings = index.settings().clone();
+    let label = format!(
+        "segments={}, docs/seg={}, tokens/doc={}, vocab={}",
+        num_segments, docs_per_segment, tokens_per_doc, vocab_size
+    );
+
+    MergeScenario {
+        index,
+        segments,
+        settings,
+        label,
+    }
+}
+
+fn main() {
+    let scenarios = vec![
+        build_index(8, 50_000, 12, 8),
+        build_index(16, 50_000, 12, 8),
+        build_index(16, 100_000, 12, 8),
+        build_index(8, 50_000, 8, 8 * 50_000 * 8),
+    ];
+
+    let mut runner = BenchRunner::new();
+    for scenario in scenarios {
+        let mut group = runner.new_group();
+        group.set_name(format!("merge_segments inv_index — {}", scenario.label));
+        let segments = scenario.segments.clone();
+        let settings = scenario.settings.clone();
+        group.register("merge", move |_| {
+            let output_dir = NullDirectory::default();
+            let filter_doc_ids = vec![None; segments.len()];
+            let merged_index =
+                merge_filtered_segments(&segments, settings.clone(), filter_doc_ids, output_dir)
+                    .unwrap();
+            black_box(merged_index);
+        });
+
+        group.run();
+    }
+}
--- a/benches/range_queries.rs
+++ b/benches/range_queries.rs
@@ -33,7 +33,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
        match distribution {
            "dense" => {
                for doc_id in 0..num_docs {
-                    let num_rand = rng.gen_range(0u64..1000u64);
+                    let num_rand = rng.random_range(0u64..1000u64);
                    let num_asc = (doc_id / 10000) as u64;

                    writer
@@ -46,7 +46,7 @@ fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
            }
            "sparse" => {
                for doc_id in 0..num_docs {
-                    let num_rand = rng.gen_range(0u64..10000000u64);
+                    let num_rand = rng.random_range(0u64..10000000u64);
                    let num_asc = doc_id as u64;

                    writer
--- a/benches/range_query.rs
+++ b/benches/range_query.rs
@@ -97,20 +97,20 @@ fn get_index_0_to_100() -> Index {
    let num_vals = 100_000;
    let docs: Vec<_> = (0..num_vals)
        .map(|_i| {
-            let id_name = if rng.gen_bool(0.01) {
+            let id_name = if rng.random_bool(0.01) {
                "veryfew".to_string() // 1%
-            } else if rng.gen_bool(0.1) {
+            } else if rng.random_bool(0.1) {
                "few".to_string() // 9%
            } else {
                "most".to_string() // 90%
            };
            Doc {
                id_name,
-                id: rng.gen_range(0..100),
+                id: rng.random_range(0..100),
                // Multiply by 1000, so that we create most buckets in the compact space
                // The benches depend on this range to select n-percent of elements with the
                // methods below.
-                ip: Ipv6Addr::from_u128(rng.gen_range(0..100) * 1000),
+                ip: Ipv6Addr::from_u128(rng.random_range(0..100) * 1000),
            }
        })
        .collect();
--- a/benches/regex_all_terms.rs
+++ b/benches/regex_all_terms.rs
@@ -0,0 +1,113 @@
+// Benchmarks regex query that matches all terms in a synthetic index.
+//
+// Corpus model:
+// - N unique terms: t000000, t000001, ...
+// - M docs
+// - K tokens per doc: doc i gets terms derived from (i, token_index)
+//
+// Query:
+// - Regex "t.*" to match all terms
+//
+// Run with:
+// - cargo bench --bench regex_all_terms
+//
+
+use std::fmt::Write;
+
+use binggan::{black_box, BenchRunner};
+use tantivy::collector::Count;
+use tantivy::query::RegexQuery;
+use tantivy::schema::{Schema, TEXT};
+use tantivy::{doc, Index, ReloadPolicy};
+
+const HEAP_SIZE_BYTES: usize = 200_000_000;
+
+#[derive(Clone, Copy)]
+struct BenchConfig {
+    num_terms: usize,
+    num_docs: usize,
+    tokens_per_doc: usize,
+}
+
+fn main() {
+    let configs = default_configs();
+
+    let mut runner = BenchRunner::new();
+    for config in configs {
+        let (index, text_field) = build_index(config, HEAP_SIZE_BYTES);
+        let reader = index
+            .reader_builder()
+            .reload_policy(ReloadPolicy::Manual)
+            .try_into()
+            .expect("reader");
+        let searcher = reader.searcher();
+        let query = RegexQuery::from_pattern("t.*", text_field).expect("regex query");
+
+        let mut group = runner.new_group();
+        group.set_name(format!(
+            "regex_all_terms_t{}_d{}_k{}",
+            config.num_terms, config.num_docs, config.tokens_per_doc
+        ));
+        group.register("regex_count", move |_| {
+            let count = searcher.search(&query, &Count).expect("search");
+            black_box(count);
+        });
+        group.run();
+    }
+}
+
+fn default_configs() -> Vec<BenchConfig> {
+    vec![
+        BenchConfig {
+            num_terms: 10_000,
+            num_docs: 100_000,
+            tokens_per_doc: 1,
+        },
+        BenchConfig {
+            num_terms: 10_000,
+            num_docs: 100_000,
+            tokens_per_doc: 8,
+        },
+        BenchConfig {
+            num_terms: 100_000,
+            num_docs: 100_000,
+            tokens_per_doc: 1,
+        },
+        BenchConfig {
+            num_terms: 100_000,
+            num_docs: 100_000,
+            tokens_per_doc: 8,
+        },
+    ]
+}
+
+fn build_index(config: BenchConfig, heap_size_bytes: usize) -> (Index, tantivy::schema::Field) {
+    let mut schema_builder = Schema::builder();
+    let text_field = schema_builder.add_text_field("text", TEXT);
+    let schema = schema_builder.build();
+    let index = Index::create_in_ram(schema);
+
+    let term_width = config.num_terms.to_string().len();
+    {
+        let mut writer = index
+            .writer_with_num_threads(1, heap_size_bytes)
+            .expect("writer");
+        let mut buffer = String::new();
+        for doc_id in 0..config.num_docs {
+            buffer.clear();
+            for token_idx in 0..config.tokens_per_doc {
+                if token_idx > 0 {
+                    buffer.push(' ');
+                }
+                let term_id = (doc_id * config.tokens_per_doc + token_idx) % config.num_terms;
+                write!(&mut buffer, "t{term_id:0term_width$}").expect("write token");
+            }
+            writer
+                .add_document(doc!(text_field => buffer.as_str()))
+                .expect("add_document");
+        }
+        writer.commit().expect("commit");
+    }
+
+    (index, text_field)
+}
--- a/benches/str_search_and_get.rs
+++ b/benches/str_search_and_get.rs
@@ -0,0 +1,421 @@
+// This benchmark compares different approaches for retrieving string values:
+//
+// 1. Fast Field Approach: retrieves string values via term_ords() and ord_to_str()
+//
+// 2. Doc Store Approach: retrieves string values via searcher.doc() and field extraction
+//
+// The benchmark includes various data distributions:
+// - Dense Sequential: Sequential document IDs with dense data
+// - Dense Random: Random document IDs with dense data
+// - Sparse Sequential: Sequential document IDs with sparse data
+// - Sparse Random: Random document IDs with sparse data
+use std::ops::Bound;
+
+use binggan::{black_box, BenchGroup, BenchRunner};
+use rand::prelude::*;
+use rand::rngs::StdRng;
+use rand::SeedableRng;
+use tantivy::collector::{Count, DocSetCollector};
+use tantivy::query::RangeQuery;
+use tantivy::schema::document::TantivyDocument;
+use tantivy::schema::{Schema, Value, FAST, STORED, STRING};
+use tantivy::{doc, Index, ReloadPolicy, Searcher, Term};
+
+#[derive(Clone)]
+struct BenchIndex {
+    #[allow(dead_code)]
+    index: Index,
+    searcher: Searcher,
+}
+
+fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
+    // Schema with string fast field and stored field for doc access
+    let mut schema_builder = Schema::builder();
+    let f_str_fast = schema_builder.add_text_field("str_fast", STRING | STORED | FAST);
+    let f_str_stored = schema_builder.add_text_field("str_stored", STRING | STORED);
+    let schema = schema_builder.build();
+    let index = Index::create_in_ram(schema.clone());
+
+    // Populate index with stable RNG for reproducibility.
+    let mut rng = StdRng::from_seed([7u8; 32]);
+
+    {
+        let mut writer = index.writer_with_num_threads(1, 4_000_000_000).unwrap();
+
+        match distribution {
+            "dense_random" => {
+                for _doc_id in 0..num_docs {
+                    let suffix = rng.gen_range(0u64..1000u64);
+                    let str_val = format!("str_{:03}", suffix);
+
+                    writer
+                        .add_document(doc!(
+                            f_str_fast=>str_val.clone(),
+                            f_str_stored=>str_val,
+                        ))
+                        .unwrap();
+                }
+            }
+            "dense_sequential" => {
+                for doc_id in 0..num_docs {
+                    let suffix = doc_id as u64 % 1000;
+                    let str_val = format!("str_{:03}", suffix);
+
+                    writer
+                        .add_document(doc!(
+                            f_str_fast=>str_val.clone(),
+                            f_str_stored=>str_val,
+                        ))
+                        .unwrap();
+                }
+            }
+            "sparse_random" => {
+                for _doc_id in 0..num_docs {
+                    let suffix = rng.gen_range(0u64..1000000u64);
+                    let str_val = format!("str_{:07}", suffix);
+
+                    writer
+                        .add_document(doc!(
+                            f_str_fast=>str_val.clone(),
+                            f_str_stored=>str_val,
+                        ))
+                        .unwrap();
+                }
+            }
+            "sparse_sequential" => {
+                for doc_id in 0..num_docs {
+                    let suffix = doc_id as u64;
+                    let str_val = format!("str_{:07}", suffix);
+
+                    writer
+                        .add_document(doc!(
+                            f_str_fast=>str_val.clone(),
+                            f_str_stored=>str_val,
+                        ))
+                        .unwrap();
+                }
+            }
+            _ => {
+                panic!("Unsupported distribution type");
+            }
+        }
+        writer.commit().unwrap();
+    }
+
+    // Prepare reader/searcher once.
+    let reader = index
+        .reader_builder()
+        .reload_policy(ReloadPolicy::Manual)
+        .try_into()
+        .unwrap();
+    let searcher = reader.searcher();
+
+    BenchIndex { index, searcher }
+}
+
+fn main() {
+    // Prepare corpora with varying scenarios
+    let scenarios = vec![
+        (
+            "dense_random_search_low_range".to_string(),
+            1_000_000,
+            "dense_random",
+            0,
+            9,
+        ),
+        (
+            "dense_random_search_high_range".to_string(),
+            1_000_000,
+            "dense_random",
+            990,
+            999,
+        ),
+        (
+            "dense_sequential_search_low_range".to_string(),
+            1_000_000,
+            "dense_sequential",
+            0,
+            9,
+        ),
+        (
+            "dense_sequential_search_high_range".to_string(),
+            1_000_000,
+            "dense_sequential",
+            990,
+            999,
+        ),
+        (
+            "sparse_random_search_low_range".to_string(),
+            1_000_000,
+            "sparse_random",
+            0,
+            9999,
+        ),
+        (
+            "sparse_random_search_high_range".to_string(),
+            1_000_000,
+            "sparse_random",
+            990_000,
+            999_999,
+        ),
+        (
+            "sparse_sequential_search_low_range".to_string(),
+            1_000_000,
+            "sparse_sequential",
+            0,
+            9999,
+        ),
+        (
+            "sparse_sequential_search_high_range".to_string(),
+            1_000_000,
+            "sparse_sequential",
+            990_000,
+            999_999,
+        ),
+    ];
+
+    let mut runner = BenchRunner::new();
+    for (scenario_id, n, distribution, range_low, range_high) in scenarios {
+        let bench_index = build_shared_indices(n, distribution);
+        let mut group = runner.new_group();
+        group.set_name(scenario_id);
+
+        let field = bench_index.searcher.schema().get_field("str_fast").unwrap();
+
+        let (lower_str, upper_str) =
+            if distribution == "dense_sequential" || distribution == "dense_random" {
+                (
+                    format!("str_{:03}", range_low),
+                    format!("str_{:03}", range_high),
+                )
+            } else {
+                (
+                    format!("str_{:07}", range_low),
+                    format!("str_{:07}", range_high),
+                )
+            };
+
+        let lower_term = Term::from_field_text(field, &lower_str);
+        let upper_term = Term::from_field_text(field, &upper_str);
+
+        let query = RangeQuery::new(Bound::Included(lower_term), Bound::Included(upper_term));
+
+        run_benchmark_tasks(&mut group, &bench_index, query, range_low, range_high);
+
+        group.run();
+    }
+}
+
+/// Run all benchmark tasks for a given range query
+fn run_benchmark_tasks(
+    bench_group: &mut BenchGroup,
+    bench_index: &BenchIndex,
+    query: RangeQuery,
+    range_low: u64,
+    range_high: u64,
+) {
+    // Test count of matching documents
+    add_bench_task_count(
+        bench_group,
+        bench_index,
+        query.clone(),
+        range_low,
+        range_high,
+    );
+
+    // Test fetching all DocIds of matching documents
+    add_bench_task_docset(
+        bench_group,
+        bench_index,
+        query.clone(),
+        range_low,
+        range_high,
+    );
+
+    // Test fetching all string fast field values of matching documents
+    add_bench_task_fetch_all_strings(
+        bench_group,
+        bench_index,
+        query.clone(),
+        range_low,
+        range_high,
+    );
+
+    // Test fetching all string values of matching documents through doc() method
+    add_bench_task_fetch_all_strings_from_doc(
+        bench_group,
+        bench_index,
+        query,
+        range_low,
+        range_high,
+    );
+}
+
+fn add_bench_task_count(
+    bench_group: &mut BenchGroup,
+    bench_index: &BenchIndex,
+    query: RangeQuery,
+    range_low: u64,
+    range_high: u64,
+) {
+    let task_name = format!("string_search_count_[{}-{}]", range_low, range_high);
+
+    let search_task = CountSearchTask {
+        searcher: bench_index.searcher.clone(),
+        query,
+    };
+    bench_group.register(task_name, move |_| black_box(search_task.run()));
+}
+
+fn add_bench_task_docset(
+    bench_group: &mut BenchGroup,
+    bench_index: &BenchIndex,
+    query: RangeQuery,
+    range_low: u64,
+    range_high: u64,
+) {
+    let task_name = format!("string_fetch_all_docset_[{}-{}]", range_low, range_high);
+
+    let search_task = DocSetSearchTask {
+        searcher: bench_index.searcher.clone(),
+        query,
+    };
+    bench_group.register(task_name, move |_| black_box(search_task.run()));
+}
+
+fn add_bench_task_fetch_all_strings(
+    bench_group: &mut BenchGroup,
+    bench_index: &BenchIndex,
+    query: RangeQuery,
+    range_low: u64,
+    range_high: u64,
+) {
+    let task_name = format!(
+        "string_fastfield_fetch_all_strings_[{}-{}]",
+        range_low, range_high
+    );
+
+    let search_task = FetchAllStringsSearchTask {
+        searcher: bench_index.searcher.clone(),
+        query,
+    };
+
+    bench_group.register(task_name, move |_| {
+        let result = black_box(search_task.run());
+        result.len()
+    });
+}
+
+fn add_bench_task_fetch_all_strings_from_doc(
+    bench_group: &mut BenchGroup,
+    bench_index: &BenchIndex,
+    query: RangeQuery,
+    range_low: u64,
+    range_high: u64,
+) {
+    let task_name = format!(
+        "string_doc_fetch_all_strings_[{}-{}]",
+        range_low, range_high
+    );
+
+    let search_task = FetchAllStringsFromDocTask {
+        searcher: bench_index.searcher.clone(),
+        query,
+    };
+
+    bench_group.register(task_name, move |_| {
+        let result = black_box(search_task.run());
+        result.len()
+    });
+}
+
+struct CountSearchTask {
+    searcher: Searcher,
+    query: RangeQuery,
+}
+
+impl CountSearchTask {
+    #[inline(never)]
+    pub fn run(&self) -> usize {
+        self.searcher.search(&self.query, &Count).unwrap()
+    }
+}
+
+struct DocSetSearchTask {
+    searcher: Searcher,
+    query: RangeQuery,
+}
+
+impl DocSetSearchTask {
+    #[inline(never)]
+    pub fn run(&self) -> usize {
+        let result = self.searcher.search(&self.query, &DocSetCollector).unwrap();
+        result.len()
+    }
+}
+
+struct FetchAllStringsSearchTask {
+    searcher: Searcher,
+    query: RangeQuery,
+}
+
+impl FetchAllStringsSearchTask {
+    #[inline(never)]
+    pub fn run(&self) -> Vec<String> {
+        let doc_addresses = self.searcher.search(&self.query, &DocSetCollector).unwrap();
+        let mut docs = doc_addresses.into_iter().collect::<Vec<_>>();
+        docs.sort();
+        let mut strings = Vec::with_capacity(docs.len());
+
+        for doc_address in docs {
+            let segment_reader = &self.searcher.segment_readers()[doc_address.segment_ord as usize];
+            let str_column_opt = segment_reader.fast_fields().str("str_fast");
+
+            if let Ok(Some(str_column)) = str_column_opt {
+                let doc_id = doc_address.doc_id;
+                let term_ord = str_column.term_ords(doc_id).next().unwrap();
+                let mut str_buffer = String::new();
+                if str_column.ord_to_str(term_ord, &mut str_buffer).is_ok() {
+                    strings.push(str_buffer);
+                }
+            }
+        }
+
+        strings
+    }
+}
+
+struct FetchAllStringsFromDocTask {
+    searcher: Searcher,
+    query: RangeQuery,
+}
+
+impl FetchAllStringsFromDocTask {
+    #[inline(never)]
+    pub fn run(&self) -> Vec<String> {
+        let doc_addresses = self.searcher.search(&self.query, &DocSetCollector).unwrap();
+        let mut docs = doc_addresses.into_iter().collect::<Vec<_>>();
+        docs.sort();
+        let mut strings = Vec::with_capacity(docs.len());
+
+        let str_stored_field = self
+            .searcher
+            .schema()
+            .get_field("str_stored")
+            .expect("str_stored field should exist");
+
+        for doc_address in docs {
+            // Get the document from the doc store (row store access)
+            if let Ok(doc) = self.searcher.doc::<TantivyDocument>(doc_address) {
+                // Extract string values from the stored field
+                if let Some(field_value) = doc.get_first(str_stored_field) {
+                    if let Some(text) = field_value.as_value().as_str() {
+                        strings.push(text.to_string());
+                    }
+                }
+            }
+        }
+
+        strings
+    }
+}
--- a/bitpacker/Cargo.toml
+++ b/bitpacker/Cargo.toml
@@ -18,5 +18,5 @@ homepage = "https://github.com/quickwit-oss/tantivy"
 bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker1x"] }

 [dev-dependencies]
-rand = "0.8"
+rand = "0.9"
 proptest = "1"
--- a/bitpacker/benches/bench.rs
+++ b/bitpacker/benches/bench.rs
@@ -4,8 +4,8 @@ extern crate test;

 #[cfg(test)]
 mod tests {
+    use rand::rng;
    use rand::seq::IteratorRandom;
-    use rand::thread_rng;
    use tantivy_bitpacker::{BitPacker, BitUnpacker, BlockedBitpacker};
    use test::Bencher;

@@ -27,7 +27,7 @@ mod tests {
        let num_els = 1_000_000u32;
        let bit_unpacker = BitUnpacker::new(bit_width);
        let data = create_bitpacked_data(bit_width, num_els);
-        let idxs: Vec<u32> = (0..num_els).choose_multiple(&mut thread_rng(), 100_000);
+        let idxs: Vec<u32> = (0..num_els).choose_multiple(&mut rng(), 100_000);
        b.iter(|| {
            let mut out = 0u64;
            for &idx in &idxs {
--- a/columnar/Cargo.toml
+++ b/columnar/Cargo.toml
@@ -22,7 +22,7 @@ downcast-rs = "2.0.1"
 [dev-dependencies]
 proptest = "1"
 more-asserts = "0.3.1"
-rand = "0.8"
+rand = "0.9"
 binggan = "0.14.0"

 [[bench]]
--- a/columnar/benches/bench_column_values_get.rs
+++ b/columnar/benches/bench_column_values_get.rs
@@ -9,7 +9,7 @@ use tantivy_columnar::column_values::{CodecType, serialize_and_load_u64_based_co
 fn get_data() -> Vec<u64> {
    let mut rng = StdRng::seed_from_u64(2u64);
    let mut data: Vec<_> = (100..55_000_u64)
-        .map(|num| num + rng.r#gen::<u8>() as u64)
+        .map(|num| num + rng.random::<u8>() as u64)
        .collect();
    data.push(99_000);
    data.insert(1000, 2000);
--- a/columnar/benches/bench_create_column_values.rs
+++ b/columnar/benches/bench_create_column_values.rs
@@ -6,7 +6,7 @@ use tantivy_columnar::column_values::{CodecType, serialize_u64_based_column_valu
 fn get_data() -> Vec<u64> {
    let mut rng = StdRng::seed_from_u64(2u64);
    let mut data: Vec<_> = (100..55_000_u64)
-        .map(|num| num + rng.r#gen::<u8>() as u64)
+        .map(|num| num + rng.random::<u8>() as u64)
        .collect();
    data.push(99_000);
    data.insert(1000, 2000);
--- a/columnar/benches/bench_optional_index.rs
+++ b/columnar/benches/bench_optional_index.rs
@@ -8,7 +8,7 @@ const TOTAL_NUM_VALUES: u32 = 1_000_000;
 fn gen_optional_index(fill_ratio: f64) -> OptionalIndex {
    let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
    let vals: Vec<u32> = (0..TOTAL_NUM_VALUES)
-        .map(|_| rng.gen_bool(fill_ratio))
+        .map(|_| rng.random_bool(fill_ratio))
        .enumerate()
        .filter(|(_pos, val)| *val)
        .map(|(pos, _)| pos as u32)
@@ -25,7 +25,7 @@ fn random_range_iterator(
    let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
    let mut current = start;
    std::iter::from_fn(move || {
-        current += rng.gen_range(avg_step_size - avg_deviation..=avg_step_size + avg_deviation);
+        current += rng.random_range(avg_step_size - avg_deviation..=avg_step_size + avg_deviation);
        if current >= end { None } else { Some(current) }
    })
 }
--- a/columnar/benches/bench_values_u128.rs
+++ b/columnar/benches/bench_values_u128.rs
@@ -39,7 +39,7 @@ fn get_data_50percent_item() -> Vec<u128> {

    let mut data = vec![];
    for _ in 0..300_000 {
-        let val = rng.gen_range(1..=100);
+        let val = rng.random_range(1..=100);
        data.push(val);
    }
    data.push(SINGLE_ITEM);
--- a/columnar/benches/bench_values_u64.rs
+++ b/columnar/benches/bench_values_u64.rs
@@ -34,7 +34,7 @@ fn get_data_50percent_item() -> Vec<u128> {

    let mut data = vec![];
    for _ in 0..300_000 {
-        let val = rng.gen_range(1..=100);
+        let val = rng.random_range(1..=100);
        data.push(val);
    }
    data.push(SINGLE_ITEM);
--- a/columnar/src/column_values/u64_based/linear.rs
+++ b/columnar/src/column_values/u64_based/linear.rs
@@ -268,7 +268,7 @@ mod tests {

    #[test]
    fn linear_interpol_fast_field_rand() {
-        let mut rng = rand::thread_rng();
+        let mut rng = rand::rng();
        for _ in 0..50 {
            let mut data = (0..10_000).map(|_| rng.next_u64()).collect::<Vec<_>>();
            create_and_validate::<LinearCodec>(&data, "random");
--- a/columnar/src/column_values/u64_based/tests.rs
+++ b/columnar/src/column_values/u64_based/tests.rs
@@ -122,7 +122,7 @@ pub(crate) fn create_and_validate<TColumnCodec: ColumnCodec>(
    assert_eq!(vals, buffer);

    if !vals.is_empty() {
-        let test_rand_idx = rand::thread_rng().gen_range(0..=vals.len() - 1);
+        let test_rand_idx = rand::rng().random_range(0..=vals.len() - 1);
        let expected_positions: Vec<u32> = vals
            .iter()
            .enumerate()
--- a/common/Cargo.toml
+++ b/common/Cargo.toml
@@ -21,5 +21,5 @@ serde = { version = "1.0.136", features = ["derive"] }
 [dev-dependencies]
 binggan = "0.14.0"
 proptest = "1.0.0"
-rand = "0.8.4"
+rand = "0.9"

--- a/common/benches/bench.rs
+++ b/common/benches/bench.rs
@@ -1,6 +1,6 @@
 use binggan::{BenchRunner, black_box};
+use rand::rng;
 use rand::seq::IteratorRandom;
-use rand::thread_rng;
 use tantivy_common::{BitSet, TinySet, serialize_vint_u32};

 fn bench_vint() {
@@ -17,7 +17,7 @@ fn bench_vint() {
        black_box(out);
    });

-    let vals: Vec<u32> = (0..20_000).choose_multiple(&mut thread_rng(), 100_000);
+    let vals: Vec<u32> = (0..20_000).choose_multiple(&mut rng(), 100_000);
    runner.bench_function("bench_vint_rand", move |_| {
        let mut out = 0u64;
        for val in vals.iter().cloned() {
--- a/common/src/bitset.rs
+++ b/common/src/bitset.rs
@@ -416,7 +416,7 @@ mod tests {
    use std::collections::HashSet;

    use ownedbytes::OwnedBytes;
-    use rand::distributions::Bernoulli;
+    use rand::distr::Bernoulli;
    use rand::rngs::StdRng;
    use rand::{Rng, SeedableRng};

--- a/doc/src/json.md
+++ b/doc/src/json.md
@@ -60,7 +60,7 @@ At indexing, tantivy will try to interpret number and strings as different type
 priority order.

 Numbers will be interpreted as u64, i64 and f64 in that order.
-Strings will be interpreted as rfc3999 dates or simple strings.
+Strings will be interpreted as rfc3339 dates or simple strings.

 The first working type is picked and is the only term that is emitted for indexing.
 Note this interpretation happens on a per-document basis, and there is no effort to try to sniff
@@ -81,7 +81,7 @@ Will be interpreted as
 (my_path.my_segment, String, 233) or (my_path.my_segment, u64, 233)
 ```

-Likewise, we need to emit two tokens if the query contains an rfc3999 date.
+Likewise, we need to emit two tokens if the query contains an rfc3339 date.
 Indeed the date could have been actually a single token inside the text of a document at ingestion time. Generally speaking, we will always at least emit a string token in query parsing, and sometimes more.

 If one more json field is defined, things get even more complicated.
--- a/query-grammar/src/query_grammar.rs
+++ b/query-grammar/src/query_grammar.rs
@@ -560,7 +560,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
            (
                (
                    value((), tag(">=")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                bound
@@ -574,7 +574,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
                ),
                (
                    value((), tag("<=")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                UserInputBound::Unbounded,
@@ -588,7 +588,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
                ),
                (
                    value((), tag(">")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                bound
@@ -602,7 +602,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
                ),
                (
                    value((), tag("<")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                UserInputBound::Unbounded,
@@ -704,7 +704,11 @@ fn regex(inp: &str) -> IResult<&str, UserInputLeaf> {
                many1(alt((preceded(char('\\'), char('/')), none_of("/")))),
                char('/'),
            ),
-            peek(alt((multispace1, eof))),
+            peek(alt((
+                value((), multispace1),
+                value((), char(')')),
+                value((), eof),
+            ))),
        ),
        |elements| UserInputLeaf::Regex {
            field: None,
@@ -721,8 +725,12 @@ fn regex_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
            opt_i_err(char('/'), "missing delimiter /"),
        ),
        opt_i_err(
-            peek(alt((multispace1, eof))),
-            "expected whitespace or end of input",
+            peek(alt((
+                value((), multispace1),
+                value((), char(')')),
+                value((), eof),
+            ))),
+            "expected whitespace, closing parenthesis, or end of input",
        ),
    )(inp)
    {
@@ -1323,6 +1331,14 @@ mod test {
        test_parse_query_to_ast_helper("<a", "{\"*\" TO \"a\"}");
        test_parse_query_to_ast_helper("<=a", "{\"*\" TO \"a\"]");
        test_parse_query_to_ast_helper("<=bsd", "{\"*\" TO \"bsd\"]");
+
+        test_parse_query_to_ast_helper("(<=42)", "{\"*\" TO \"42\"]");
+        test_parse_query_to_ast_helper("(<=42 )", "{\"*\" TO \"42\"]");
+        test_parse_query_to_ast_helper("(age:>5)", "\"age\":{\"5\" TO \"*\"}");
+        test_parse_query_to_ast_helper(
+            "(title:bar AND age:>12)",
+            "(+\"title\":bar +\"age\":{\"12\" TO \"*\"})",
+        );
    }

    #[test]
@@ -1699,6 +1715,10 @@ mod test {
        test_parse_query_to_ast_helper("foo:(A OR B)", "(?\"foo\":A ?\"foo\":B)");
        test_parse_query_to_ast_helper("foo:(A* OR B*)", "(?\"foo\":A* ?\"foo\":B*)");
        test_parse_query_to_ast_helper("foo:(*A OR *B)", "(?\"foo\":*A ?\"foo\":*B)");
+
+        // Regexes between parentheses
+        test_parse_query_to_ast_helper("foo:(/A.*/)", "\"foo\":/A.*/");
+        test_parse_query_to_ast_helper("foo:(/A.*/ OR /B.*/)", "(?\"foo\":/A.*/ ?\"foo\":/B.*/)");
    }

    #[test]
--- a/query-grammar/src/user_input_ast.rs
+++ b/query-grammar/src/user_input_ast.rs
@@ -66,6 +66,7 @@ impl UserInputLeaf {
            }
            UserInputLeaf::Range { field, .. } if field.is_none() => *field = Some(default_field),
            UserInputLeaf::Set { field, .. } if field.is_none() => *field = Some(default_field),
+            UserInputLeaf::Regex { field, .. } if field.is_none() => *field = Some(default_field),
            _ => (), // field was already set, do nothing
        }
    }
--- a/src/aggregation/intermediate_agg_result.rs
+++ b/src/aggregation/intermediate_agg_result.rs
@@ -90,6 +90,19 @@ impl From<IntermediateKey> for Key {

 impl Eq for IntermediateKey {}

+impl std::fmt::Display for IntermediateKey {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        match self {
+            IntermediateKey::Str(val) => f.write_str(val),
+            IntermediateKey::F64(val) => f.write_str(&val.to_string()),
+            IntermediateKey::U64(val) => f.write_str(&val.to_string()),
+            IntermediateKey::I64(val) => f.write_str(&val.to_string()),
+            IntermediateKey::Bool(val) => f.write_str(&val.to_string()),
+            IntermediateKey::IpAddr(val) => f.write_str(&val.to_string()),
+        }
+    }
+}
+
 impl std::hash::Hash for IntermediateKey {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        core::mem::discriminant(self).hash(state);
@@ -105,6 +118,21 @@ impl std::hash::Hash for IntermediateKey {
 }

 impl IntermediateAggregationResults {
+    /// Returns a reference to the intermediate aggregation result for the given key.
+    pub fn get(&self, key: &str) -> Option<&IntermediateAggregationResult> {
+        self.aggs_res.get(key)
+    }
+
+    /// Removes and returns the intermediate aggregation result for the given key.
+    pub fn remove(&mut self, key: &str) -> Option<IntermediateAggregationResult> {
+        self.aggs_res.remove(key)
+    }
+
+    /// Returns an iterator over the keys in the intermediate aggregation results.
+    pub fn keys(&self) -> impl Iterator<Item = &String> {
+        self.aggs_res.keys()
+    }
+
    /// Add a result
    pub fn push(&mut self, key: String, value: IntermediateAggregationResult) -> crate::Result<()> {
        let entry = self.aggs_res.entry(key);
@@ -639,6 +667,21 @@ pub struct IntermediateTermBucketResult {
 }

 impl IntermediateTermBucketResult {
+    /// Returns a reference to the map of bucket entries keyed by [`IntermediateKey`].
+    pub fn entries(&self) -> &FxHashMap<IntermediateKey, IntermediateTermBucketEntry> {
+        &self.entries
+    }
+
+    /// Returns the count of documents not included in the returned buckets.
+    pub fn sum_other_doc_count(&self) -> u64 {
+        self.sum_other_doc_count
+    }
+
+    /// Returns the upper bound of the error on document counts in the returned buckets.
+    pub fn doc_count_error_upper_bound(&self) -> u64 {
+        self.doc_count_error_upper_bound
+    }
+
    pub(crate) fn into_final_result(
        self,
        req: &TermsAggregation,
@@ -820,7 +863,7 @@ impl IntermediateRangeBucketEntry {
        };

        // If we have a date type on the histogram buckets, we add the `key_as_string` field as
-        // rfc339
+        // rfc3339
        if column_type == Some(ColumnType::DateTime) {
            if let Some(val) = range_bucket_entry.to {
                let key_as_string = format_date(val as i64)?;
--- a/src/aggregation/metric/average.rs
+++ b/src/aggregation/metric/average.rs
@@ -55,6 +55,12 @@ impl IntermediateAverage {
    pub(crate) fn from_stats(stats: IntermediateStats) -> Self {
        Self { stats }
    }
+
+    /// Returns a reference to the underlying [`IntermediateStats`].
+    pub fn stats(&self) -> &IntermediateStats {
+        &self.stats
+    }
+
    /// Merges the other intermediate result into self.
    pub fn merge_fruits(&mut self, other: IntermediateAverage) {
        self.stats.merge_fruits(other.stats);
--- a/src/aggregation/metric/cardinality.rs
+++ b/src/aggregation/metric/cardinality.rs
@@ -1,12 +1,11 @@
-use std::collections::hash_map::DefaultHasher;
-use std::hash::{BuildHasher, Hasher};
+use std::hash::Hash;

 use columnar::column_values::CompactSpaceU64Accessor;
 use columnar::{Column, ColumnType, Dictionary, StrColumn};
 use common::f64_to_u64;
-use hyperloglogplus::{HyperLogLog, HyperLogLogPlus};
+use datasketches::hll::{HllSketch, HllType, HllUnion};
 use rustc_hash::FxHashSet;
-use serde::{Deserialize, Serialize};
+use serde::{Deserialize, Deserializer, Serialize, Serializer};

 use crate::aggregation::agg_data::AggregationsSegmentCtx;
 use crate::aggregation::intermediate_agg_result::{
@@ -16,29 +15,17 @@ use crate::aggregation::segment_agg_result::SegmentAggregationCollector;
 use crate::aggregation::*;
 use crate::TantivyError;

-#[derive(Clone, Debug, Serialize, Deserialize)]
-struct BuildSaltedHasher {
-    salt: u8,
-}
-
-impl BuildHasher for BuildSaltedHasher {
-    type Hasher = DefaultHasher;
-
-    fn build_hasher(&self) -> Self::Hasher {
-        let mut hasher = DefaultHasher::new();
-        hasher.write_u8(self.salt);
-
-        hasher
-    }
-}
+/// Log2 of the number of registers. Must match the Java `Union(LOG2M)` where LOG2M=11.
+/// 2^11 = 2048 registers.
+const LG_K: u8 = 11;

 /// # Cardinality
 ///
 /// The cardinality aggregation allows for computing an estimate
 /// of the number of different values in a data set based on the
-/// HyperLogLog++ algorithm. This is particularly useful for understanding the
-/// uniqueness of values in a large dataset where counting each unique value
-/// individually would be computationally expensive.
+/// Apache DataSketches HyperLogLog algorithm. This is particularly useful for
+/// understanding the uniqueness of values in a large dataset where counting
+/// each unique value individually would be computationally expensive.
 ///
 /// For example, you might use a cardinality aggregation to estimate the number
 /// of unique visitors to a website by aggregating on a field that contains
@@ -184,7 +171,7 @@ impl SegmentCardinalityCollectorBucket {

            term_ids.sort_unstable();
            dict.sorted_ords_to_term_cb(term_ids.iter().map(|term| *term as u64), |term| {
-                self.cardinality.sketch.insert_any(&term);
+                self.cardinality.insert(term);
                Ok(())
            })?;
            if has_missing {
@@ -195,17 +182,17 @@ impl SegmentCardinalityCollectorBucket {
                    );
                match missing_key {
                    Key::Str(missing) => {
-                        self.cardinality.sketch.insert_any(&missing);
+                        self.cardinality.insert(missing.as_str());
                    }
                    Key::F64(val) => {
                        let val = f64_to_u64(*val);
-                        self.cardinality.sketch.insert_any(&val);
+                        self.cardinality.insert(val);
                    }
                    Key::U64(val) => {
-                        self.cardinality.sketch.insert_any(&val);
+                        self.cardinality.insert(*val);
                    }
                    Key::I64(val) => {
-                        self.cardinality.sketch.insert_any(&val);
+                        self.cardinality.insert(*val);
                    }
                }
            }
@@ -296,11 +283,11 @@ impl SegmentAggregationCollector for SegmentCardinalityCollector {
                })?;
            for val in col_block_accessor.iter_vals() {
                let val: u128 = compact_space_accessor.compact_to_u128(val as u32);
-                bucket.cardinality.sketch.insert_any(&val);
+                bucket.cardinality.insert(val);
            }
        } else {
            for val in col_block_accessor.iter_vals() {
-                bucket.cardinality.sketch.insert_any(&val);
+                bucket.cardinality.insert(val);
            }
        }

@@ -321,11 +308,17 @@ impl SegmentAggregationCollector for SegmentCardinalityCollector {
    }
 }

-#[derive(Clone, Debug, Serialize, Deserialize)]
-/// The percentiles collector used during segment collection and for merging results.
+#[derive(Clone, Debug)]
+/// The cardinality collector used during segment collection and for merging results.
+/// Uses Apache DataSketches HLL (lg_k=11) for compatibility with Datadog's event query.
 pub struct CardinalityCollector {
-    sketch: HyperLogLogPlus<u64, BuildSaltedHasher>,
+    sketch: HllSketch,
+    /// Salt derived from `ColumnType`, used to differentiate values of different column types
+    /// that map to the same u64 (e.g. bool `false` = 0 vs i64 `0`).
+    /// Not serialized — only needed during insertion, not after sketch registers are populated.
+    salt: u8,
 }
+
 impl Default for CardinalityCollector {
    fn default() -> Self {
        Self::new(0)
@@ -338,25 +331,52 @@ impl PartialEq for CardinalityCollector {
    }
 }

-impl CardinalityCollector {
-    /// Compute the final cardinality estimate.
-    pub fn finalize(self) -> Option<f64> {
-        Some(self.sketch.clone().count().trunc())
+impl Serialize for CardinalityCollector {
+    fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
+        let bytes = self.sketch.serialize();
+        serializer.serialize_bytes(&bytes)
    }
+}

+impl<'de> Deserialize<'de> for CardinalityCollector {
+    fn deserialize<D: Deserializer<'de>>(deserializer: D) -> Result<Self, D::Error> {
+        let bytes: Vec<u8> = Deserialize::deserialize(deserializer)?;
+        let sketch = HllSketch::deserialize(&bytes).map_err(serde::de::Error::custom)?;
+        Ok(Self { sketch, salt: 0 })
+    }
+}
+
+impl CardinalityCollector {
    fn new(salt: u8) -> Self {
        Self {
-            sketch: HyperLogLogPlus::new(16, BuildSaltedHasher { salt }).unwrap(),
+            sketch: HllSketch::new(LG_K, HllType::Hll4),
+            salt,
        }
    }

-    pub(crate) fn merge_fruits(&mut self, right: CardinalityCollector) -> crate::Result<()> {
-        self.sketch.merge(&right.sketch).map_err(|err| {
-            TantivyError::AggregationError(AggregationError::InternalError(format!(
-                "Error while merging cardinality {err:?}"
-            )))
-        })?;
+    /// Insert a value into the HLL sketch, salted by the column type.
+    /// The salt ensures that identical u64 values from different column types
+    /// (e.g. bool `false` vs i64 `0`) are counted as distinct.
+    pub(crate) fn insert<T: Hash>(&mut self, value: T) {
+        self.sketch.update((self.salt, value));
+    }

+    /// Compute the final cardinality estimate.
+    pub fn finalize(self) -> Option<f64> {
+        Some(self.sketch.estimate().trunc())
+    }
+
+    /// Serialize the HLL sketch to its compact binary representation.
+    /// This format is compatible with Apache DataSketches Java (`HllSketch.heapify()`).
+    pub fn to_sketch_bytes(&self) -> Vec<u8> {
+        self.sketch.serialize()
+    }
+
+    pub(crate) fn merge_fruits(&mut self, right: CardinalityCollector) -> crate::Result<()> {
+        let mut union = HllUnion::new(LG_K);
+        union.update(&self.sketch);
+        union.update(&right.sketch);
+        self.sketch = union.get_result(HllType::Hll4);
        Ok(())
    }
 }
@@ -518,4 +538,75 @@ mod tests {

        Ok(())
    }
+
+    #[test]
+    fn cardinality_collector_serde_roundtrip() {
+        use super::CardinalityCollector;
+
+        let mut collector = CardinalityCollector::default();
+        collector.insert("hello");
+        collector.insert("world");
+        collector.insert("hello"); // duplicate
+
+        let serialized = serde_json::to_vec(&collector).unwrap();
+        let deserialized: CardinalityCollector = serde_json::from_slice(&serialized).unwrap();
+
+        let original_estimate = collector.finalize().unwrap();
+        let roundtrip_estimate = deserialized.finalize().unwrap();
+        assert_eq!(original_estimate, roundtrip_estimate);
+        assert_eq!(original_estimate, 2.0);
+    }
+
+    #[test]
+    fn cardinality_collector_merge() {
+        use super::CardinalityCollector;
+
+        let mut left = CardinalityCollector::default();
+        left.insert("a");
+        left.insert("b");
+
+        let mut right = CardinalityCollector::default();
+        right.insert("b");
+        right.insert("c");
+
+        left.merge_fruits(right).unwrap();
+        let estimate = left.finalize().unwrap();
+        assert_eq!(estimate, 3.0);
+    }
+
+    #[test]
+    fn cardinality_collector_serialize_deserialize_binary() {
+        use datasketches::hll::HllSketch;
+
+        use super::CardinalityCollector;
+
+        let mut collector = CardinalityCollector::default();
+        collector.insert("apple");
+        collector.insert("banana");
+        collector.insert("cherry");
+
+        let bytes = collector.to_sketch_bytes();
+        let deserialized = HllSketch::deserialize(&bytes).unwrap();
+        assert!((deserialized.estimate() - 3.0).abs() < 0.01);
+    }
+
+    #[test]
+    fn cardinality_collector_salt_differentiates_types() {
+        use super::CardinalityCollector;
+
+        // Without salt, same u64 value from different column types would collide
+        let mut collector_bool = CardinalityCollector::new(5); // e.g. ColumnType::Bool
+        collector_bool.insert(0u64); // false
+        collector_bool.insert(1u64); // true
+
+        let mut collector_i64 = CardinalityCollector::new(2); // e.g. ColumnType::I64
+        collector_i64.insert(0u64);
+        collector_i64.insert(1u64);
+
+        // Merge them
+        collector_bool.merge_fruits(collector_i64).unwrap();
+        let estimate = collector_bool.finalize().unwrap();
+        // Should be 4 because salt makes (5, 0) != (2, 0) and (5, 1) != (2, 1)
+        assert_eq!(estimate, 4.0);
+    }
 }
--- a/src/aggregation/metric/mod.rs
+++ b/src/aggregation/metric/mod.rs
@@ -107,8 +107,11 @@ pub enum PercentileValues {
 #[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
 /// The entry when requesting percentiles with keyed: false
 pub struct PercentileValuesVecEntry {
-    key: f64,
-    value: f64,
+    /// Percentile
+    pub key: f64,
+
+    /// Value at the percentile
+    pub value: f64,
 }

 /// Single-metric aggregations use this common result structure.
--- a/src/aggregation/metric/stats.rs
+++ b/src/aggregation/metric/stats.rs
@@ -110,6 +110,16 @@ impl Default for IntermediateStats {
 }

 impl IntermediateStats {
+    /// Returns the number of values collected.
+    pub fn count(&self) -> u64 {
+        self.count
+    }
+
+    /// Returns the sum of all values collected.
+    pub fn sum(&self) -> f64 {
+        self.sum
+    }
+
    /// Merges the other stats intermediate result into self.
    pub fn merge_fruits(&mut self, other: IntermediateStats) {
        self.count += other.count;
--- a/src/collector/facet_collector.rs
+++ b/src/collector/facet_collector.rs
@@ -486,9 +486,9 @@ mod tests {
    use std::collections::BTreeSet;

    use columnar::Dictionary;
-    use rand::distributions::Uniform;
+    use rand::distr::Uniform;
    use rand::prelude::SliceRandom;
-    use rand::{thread_rng, Rng};
+    use rand::{rng, Rng};

    use super::{FacetCollector, FacetCounts};
    use crate::collector::facet_collector::compress_mapping;
@@ -731,7 +731,7 @@ mod tests {
        let schema = schema_builder.build();
        let index = Index::create_in_ram(schema);

-        let uniform = Uniform::new_inclusive(1, 100_000);
+        let uniform = Uniform::new_inclusive(1, 100_000).unwrap();
        let mut docs: Vec<TantivyDocument> =
            vec![("a", 10), ("b", 100), ("c", 7), ("d", 12), ("e", 21)]
                .into_iter()
@@ -741,14 +741,11 @@ mod tests {
                    std::iter::repeat_n(doc, count)
                })
                .map(|mut doc| {
-                    doc.add_facet(
-                        facet_field,
-                        &format!("/facet/{}", thread_rng().sample(uniform)),
-                    );
+                    doc.add_facet(facet_field, &format!("/facet/{}", rng().sample(uniform)));
                    doc
                })
                .collect();
-        docs[..].shuffle(&mut thread_rng());
+        docs[..].shuffle(&mut rng());

        let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
        for doc in docs {
@@ -822,8 +819,8 @@ mod tests {
 #[cfg(all(test, feature = "unstable"))]
 mod bench {

+    use rand::rng;
    use rand::seq::SliceRandom;
-    use rand::thread_rng;
    use test::Bencher;

    use crate::collector::FacetCollector;
@@ -846,7 +843,7 @@ mod bench {
            }
        }
        // 40425 docs
-        docs[..].shuffle(&mut thread_rng());
+        docs[..].shuffle(&mut rng());

        let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
        for doc in docs {
--- a/src/collector/sort_key/mod.rs
+++ b/src/collector/sort_key/mod.rs
@@ -1,4 +1,5 @@
 mod order;
+mod sort_by_bytes;
 mod sort_by_erased_type;
 mod sort_by_score;
 mod sort_by_static_fast_value;
@@ -6,6 +7,7 @@ mod sort_by_string;
 mod sort_key_computer;

 pub use order::*;
+pub use sort_by_bytes::SortByBytes;
 pub use sort_by_erased_type::SortByErasedType;
 pub use sort_by_score::SortBySimilarityScore;
 pub use sort_by_static_fast_value::SortByStaticFastValue;
--- a/src/collector/sort_key/sort_by_bytes.rs
+++ b/src/collector/sort_key/sort_by_bytes.rs
@@ -0,0 +1,168 @@
+use columnar::BytesColumn;
+
+use crate::collector::sort_key::NaturalComparator;
+use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
+use crate::termdict::TermOrdinal;
+use crate::{DocId, Score};
+
+/// Sort by the first value of a bytes column.
+///
+/// If the field is multivalued, only the first value is considered.
+///
+/// Documents that do not have this value are still considered.
+/// Their sort key will simply be `None`.
+#[derive(Debug, Clone)]
+pub struct SortByBytes {
+    column_name: String,
+}
+
+impl SortByBytes {
+    /// Creates a new sort by bytes sort key computer.
+    pub fn for_field(column_name: impl ToString) -> Self {
+        SortByBytes {
+            column_name: column_name.to_string(),
+        }
+    }
+}
+
+impl SortKeyComputer for SortByBytes {
+    type SortKey = Option<Vec<u8>>;
+    type Child = ByBytesColumnSegmentSortKeyComputer;
+    type Comparator = NaturalComparator;
+
+    fn segment_sort_key_computer(
+        &self,
+        segment_reader: &crate::SegmentReader,
+    ) -> crate::Result<Self::Child> {
+        let bytes_column_opt = segment_reader.fast_fields().bytes(&self.column_name)?;
+        Ok(ByBytesColumnSegmentSortKeyComputer { bytes_column_opt })
+    }
+}
+
+/// Segment-level sort key computer for bytes columns.
+pub struct ByBytesColumnSegmentSortKeyComputer {
+    bytes_column_opt: Option<BytesColumn>,
+}
+
+impl SegmentSortKeyComputer for ByBytesColumnSegmentSortKeyComputer {
+    type SortKey = Option<Vec<u8>>;
+    type SegmentSortKey = Option<TermOrdinal>;
+    type SegmentComparator = NaturalComparator;
+
+    #[inline(always)]
+    fn segment_sort_key(&mut self, doc: DocId, _score: Score) -> Option<TermOrdinal> {
+        let bytes_column = self.bytes_column_opt.as_ref()?;
+        bytes_column.ords().first(doc)
+    }
+
+    fn convert_segment_sort_key(&self, term_ord_opt: Option<TermOrdinal>) -> Option<Vec<u8>> {
+        // TODO: Individual lookups to the dictionary like this are very likely to repeatedly
+        // decompress the same blocks. See https://github.com/quickwit-oss/tantivy/issues/2776
+        let term_ord = term_ord_opt?;
+        let bytes_column = self.bytes_column_opt.as_ref()?;
+        let mut bytes = Vec::new();
+        bytes_column
+            .dictionary()
+            .ord_to_term(term_ord, &mut bytes)
+            .ok()?;
+        Some(bytes)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::SortByBytes;
+    use crate::collector::TopDocs;
+    use crate::query::AllQuery;
+    use crate::schema::{BytesOptions, Schema, FAST, INDEXED};
+    use crate::{Index, IndexWriter, Order, TantivyDocument};
+
+    #[test]
+    fn test_sort_by_bytes_asc() -> crate::Result<()> {
+        let mut schema_builder = Schema::builder();
+        let bytes_field = schema_builder
+            .add_bytes_field("data", BytesOptions::default().set_fast().set_indexed());
+        let id_field = schema_builder.add_u64_field("id", FAST | INDEXED);
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema);
+        let mut index_writer: IndexWriter = index.writer_for_tests()?;
+
+        // Insert documents with byte values in non-sorted order
+        let test_data: Vec<(u64, Vec<u8>)> = vec![
+            (1, vec![0x02, 0x00]),
+            (2, vec![0x00, 0x10]),
+            (3, vec![0x01, 0x00]),
+            (4, vec![0x00, 0x20]),
+        ];
+
+        for (id, bytes) in &test_data {
+            let mut doc = TantivyDocument::new();
+            doc.add_u64(id_field, *id);
+            doc.add_bytes(bytes_field, bytes);
+            index_writer.add_document(doc)?;
+        }
+        index_writer.commit()?;
+
+        let reader = index.reader()?;
+        let searcher = reader.searcher();
+
+        // Sort ascending by bytes
+        let top_docs =
+            TopDocs::with_limit(10).order_by((SortByBytes::for_field("data"), Order::Asc));
+        let results: Vec<(Option<Vec<u8>>, _)> = searcher.search(&AllQuery, &top_docs)?;
+
+        // Expected order: [0x00,0x10], [0x00,0x20], [0x01,0x00], [0x02,0x00]
+        let sorted_bytes: Vec<Option<Vec<u8>>> = results.into_iter().map(|(b, _)| b).collect();
+        assert_eq!(
+            sorted_bytes,
+            vec![
+                Some(vec![0x00, 0x10]),
+                Some(vec![0x00, 0x20]),
+                Some(vec![0x01, 0x00]),
+                Some(vec![0x02, 0x00]),
+            ]
+        );
+
+        Ok(())
+    }
+
+    #[test]
+    fn test_sort_by_bytes_desc() -> crate::Result<()> {
+        let mut schema_builder = Schema::builder();
+        let bytes_field = schema_builder
+            .add_bytes_field("data", BytesOptions::default().set_fast().set_indexed());
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema);
+        let mut index_writer: IndexWriter = index.writer_for_tests()?;
+
+        let test_data: Vec<Vec<u8>> = vec![vec![0x00, 0x10], vec![0x02, 0x00], vec![0x01, 0x00]];
+
+        for bytes in &test_data {
+            let mut doc = TantivyDocument::new();
+            doc.add_bytes(bytes_field, bytes);
+            index_writer.add_document(doc)?;
+        }
+        index_writer.commit()?;
+
+        let reader = index.reader()?;
+        let searcher = reader.searcher();
+
+        // Sort descending by bytes
+        let top_docs =
+            TopDocs::with_limit(10).order_by((SortByBytes::for_field("data"), Order::Desc));
+        let results: Vec<(Option<Vec<u8>>, _)> = searcher.search(&AllQuery, &top_docs)?;
+
+        // Expected order (descending): [0x02,0x00], [0x01,0x00], [0x00,0x10]
+        let sorted_bytes: Vec<Option<Vec<u8>>> = results.into_iter().map(|(b, _)| b).collect();
+        assert_eq!(
+            sorted_bytes,
+            vec![
+                Some(vec![0x02, 0x00]),
+                Some(vec![0x01, 0x00]),
+                Some(vec![0x00, 0x10]),
+            ]
+        );
+
+        Ok(())
+    }
+}
--- a/src/collector/sort_key/sort_by_erased_type.rs
+++ b/src/collector/sort_key/sort_by_erased_type.rs
@@ -1,7 +1,7 @@
 use columnar::{ColumnType, MonotonicallyMappableToU64};

 use crate::collector::sort_key::{
-    NaturalComparator, SortBySimilarityScore, SortByStaticFastValue, SortByString,
+    NaturalComparator, SortByBytes, SortBySimilarityScore, SortByStaticFastValue, SortByString,
 };
 use crate::collector::{SegmentSortKeyComputer, SortKeyComputer};
 use crate::fastfield::FastFieldNotAvailableError;
@@ -114,6 +114,16 @@ impl SortKeyComputer for SortByErasedType {
                            },
                        })
                    }
+                    ColumnType::Bytes => {
+                        let computer = SortByBytes::for_field(column_name);
+                        let inner = computer.segment_sort_key_computer(segment_reader)?;
+                        Box::new(ErasedSegmentSortKeyComputerWrapper {
+                            inner,
+                            converter: |val: Option<Vec<u8>>| {
+                                val.map(OwnedValue::Bytes).unwrap_or(OwnedValue::Null)
+                            },
+                        })
+                    }
                    ColumnType::U64 => {
                        let computer = SortByStaticFastValue::<u64>::for_field(column_name);
                        let inner = computer.segment_sort_key_computer(segment_reader)?;
@@ -281,6 +291,65 @@ mod tests {
        );
    }

+    #[test]
+    fn test_sort_by_owned_bytes() {
+        let mut schema_builder = Schema::builder();
+        let data_field = schema_builder.add_bytes_field("data", FAST);
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema);
+        let mut writer = index.writer_for_tests().unwrap();
+        writer
+            .add_document(doc!(data_field => vec![0x03u8, 0x00]))
+            .unwrap();
+        writer
+            .add_document(doc!(data_field => vec![0x01u8, 0x00]))
+            .unwrap();
+        writer
+            .add_document(doc!(data_field => vec![0x02u8, 0x00]))
+            .unwrap();
+        writer.add_document(doc!()).unwrap();
+        writer.commit().unwrap();
+
+        let reader = index.reader().unwrap();
+        let searcher = reader.searcher();
+
+        // Sort descending (Natural - highest first)
+        let collector = TopDocs::with_limit(10)
+            .order_by((SortByErasedType::for_field("data"), ComparatorEnum::Natural));
+        let top_docs = searcher.search(&AllQuery, &collector).unwrap();
+
+        let values: Vec<OwnedValue> = top_docs.into_iter().map(|(key, _)| key).collect();
+
+        assert_eq!(
+            values,
+            vec![
+                OwnedValue::Bytes(vec![0x03, 0x00]),
+                OwnedValue::Bytes(vec![0x02, 0x00]),
+                OwnedValue::Bytes(vec![0x01, 0x00]),
+                OwnedValue::Null
+            ]
+        );
+
+        // Sort ascending (ReverseNoneLower - lowest first, nulls last)
+        let collector = TopDocs::with_limit(10).order_by((
+            SortByErasedType::for_field("data"),
+            ComparatorEnum::ReverseNoneLower,
+        ));
+        let top_docs = searcher.search(&AllQuery, &collector).unwrap();
+
+        let values: Vec<OwnedValue> = top_docs.into_iter().map(|(key, _)| key).collect();
+
+        assert_eq!(
+            values,
+            vec![
+                OwnedValue::Bytes(vec![0x01, 0x00]),
+                OwnedValue::Bytes(vec![0x02, 0x00]),
+                OwnedValue::Bytes(vec![0x03, 0x00]),
+                OwnedValue::Null
+            ]
+        );
+    }
+
    #[test]
    fn test_sort_by_owned_reverse() {
        let mut schema_builder = Schema::builder();
--- a/src/collector/sort_key_top_collector.rs
+++ b/src/collector/sort_key_top_collector.rs
@@ -160,7 +160,7 @@ mod tests {
        expected: &[(crate::Score, usize)],
    ) {
        let mut vals: Vec<(crate::Score, usize)> = (0..10).map(|val| (val as f32, val)).collect();
-        vals.shuffle(&mut rand::thread_rng());
+        vals.shuffle(&mut rand::rng());
        let vals_merged = merge_top_k(vals.into_iter(), doc_range, ComparatorEnum::from(order));
        assert_eq!(&vals_merged, expected);
    }
--- a/src/directory/mmap_directory/mod.rs
+++ b/src/directory/mmap_directory/mod.rs
@@ -676,7 +676,7 @@ mod tests {
            let num_segments = reader.searcher().segment_readers().len();
            assert!(num_segments <= 4);
            let num_components_except_deletes_and_tempstore =
-                crate::index::SegmentComponent::iterator().len() - 2;
+                crate::index::SegmentComponent::iterator().len() - 1;
            let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments;
            assert_eventually(|| {
                let num_mmapped = mmap_directory.get_cache_info().mmapped.len();
--- a/src/docset.rs
+++ b/src/docset.rs
@@ -51,31 +51,55 @@ pub trait DocSet: Send {
        doc
    }

-    /// Seeks to the target if possible and returns true if the target is in the DocSet.
+    /// !!!Dragons ahead!!!
+    /// In spirit, this is an approximate and dangerous version of `seek`.
+    ///
+    /// It can leave the DocSet in an `invalid` state and might return a
+    /// lower bound of what the result of Seek would have been.
+    ///
+    ///
+    /// More accurately it returns either:
+    /// - Found if the target is in the docset. In that case, the DocSet is left in a valid state.
+    /// - SeekLowerBound(seek_lower_bound) if the target is not in the docset. In that case, The
+    ///   DocSet can be the left in a invalid state. The DocSet should then only receives call to
+    ///   `seek_danger(..)` until it returns `Found`, and get back to a valid state.
+    ///
+    /// `seek_lower_bound` can be any `DocId` (in the docset or not) as long as it is in
+    /// `(target .. seek_result] U {TERMINATED}` where `seek_result` is the first document in the
+    /// docset greater than to `target`.
+    ///
+    /// `seek_danger` may return `SeekLowerBound(TERMINATED)`.
+    ///
+    /// Calling `seek_danger` with TERMINATED as a target is allowed,
+    /// and should always return NewTarget(TERMINATED) or anything larger as TERMINATED is NOT in
+    /// the DocSet.
    ///
    /// DocSets that already have an efficient `seek` method don't need to implement
-    /// `seek_into_the_danger_zone`. All wrapper DocSets should forward
-    /// `seek_into_the_danger_zone` to the underlying DocSet.
+    /// `seek_danger`.
    ///
-    /// ## API Behaviour
-    /// If `seek_into_the_danger_zone` is returning true, a call to `doc()` has to return target.
-    /// If `seek_into_the_danger_zone` is returning false, a call to `doc()` may return any doc
-    /// between the last doc that matched and target or a doc that is a valid next hit after
-    /// target. The DocSet is considered to be in an invalid state until
-    /// `seek_into_the_danger_zone` returns true again.
-    ///
-    /// `target` needs to be equal or larger than `doc` when in a valid state.
-    ///
-    /// Consecutive calls are not allowed to have decreasing `target` values.
-    ///
-    /// # Warning
-    /// This is an advanced API used by intersection. The API contract is tricky, avoid using it.
-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
-        let current_doc = self.doc();
-        if current_doc < target {
-            self.seek(target);
+    /// Consecutive calls to seek_danger are guaranteed to have strictly increasing `target`
+    /// values.
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
+        if target >= TERMINATED {
+            debug_assert!(target == TERMINATED);
+            // No need to advance.
+            return SeekDangerResult::SeekLowerBound(target);
+        }
+
+        // The default implementation does not include any
+        // `danger zone` behavior.
+        //
+        // It does not leave the scorer in an invalid state.
+        // For this reason, we can safely call `self.doc()`.
+        let mut doc = self.doc();
+        if doc < target {
+            doc = self.seek(target);
+        }
+        if doc == target {
+            SeekDangerResult::Found
+        } else {
+            SeekDangerResult::SeekLowerBound(doc)
        }
-        self.doc() == target
    }

    /// Fills a given mutable buffer with the next doc ids from the
@@ -166,6 +190,17 @@ pub trait DocSet: Send {
    }
 }

+#[derive(Clone, Copy, Debug, PartialEq, Eq)]
+pub enum SeekDangerResult {
+    /// The target was found in the DocSet.
+    Found,
+    /// The target was not found in the DocSet.
+    /// We return a range in which the value could be.
+    /// The given target can be any DocId, that is <= than the first document
+    /// in the docset after the target.
+    SeekLowerBound(DocId),
+}
+
 impl DocSet for &mut dyn DocSet {
    fn advance(&mut self) -> u32 {
        (**self).advance()
@@ -175,8 +210,8 @@ impl DocSet for &mut dyn DocSet {
        (**self).seek(target)
    }

-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
-        (**self).seek_into_the_danger_zone(target)
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
+        (**self).seek_danger(target)
    }

    fn doc(&self) -> u32 {
@@ -211,9 +246,9 @@ impl<TDocSet: DocSet + ?Sized> DocSet for Box<TDocSet> {
        unboxed.seek(target)
    }

-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
        let unboxed: &mut TDocSet = self.borrow_mut();
-        unboxed.seek_into_the_danger_zone(target)
+        unboxed.seek_danger(target)
    }

    fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
--- a/src/fastfield/alive_bitset.rs
+++ b/src/fastfield/alive_bitset.rs
@@ -162,7 +162,7 @@ mod tests {
 mod bench {

    use rand::prelude::IteratorRandom;
-    use rand::thread_rng;
+    use rand::rng;
    use test::Bencher;

    use super::AliveBitSet;
@@ -176,7 +176,7 @@ mod bench {
    }

    fn remove_rand(raw: &mut Vec<u32>) {
-        let i = (0..raw.len()).choose(&mut thread_rng()).unwrap();
+        let i = (0..raw.len()).choose(&mut rng()).unwrap();
        raw.remove(i);
    }

--- a/src/fastfield/mod.rs
+++ b/src/fastfield/mod.rs
@@ -879,7 +879,7 @@ mod tests {
        const ONE_HOUR_IN_MICROSECS: i64 = 3_600 * 1_000_000;
        let times: Vec<DateTime> = std::iter::repeat_with(|| {
            // +- One hour.
-            let t = T0 + rng.gen_range(-ONE_HOUR_IN_MICROSECS..ONE_HOUR_IN_MICROSECS);
+            let t = T0 + rng.random_range(-ONE_HOUR_IN_MICROSECS..ONE_HOUR_IN_MICROSECS);
            DateTime::from_timestamp_micros(t)
        })
        .take(1_000)
--- a/src/functional_test.rs
+++ b/src/functional_test.rs
@@ -1,6 +1,6 @@
 use std::collections::HashSet;

-use rand::{thread_rng, Rng};
+use rand::{rng, Rng};

 use crate::indexer::index_writer::MEMORY_BUDGET_NUM_BYTES_MIN;
 use crate::schema::*;
@@ -29,7 +29,7 @@ fn test_functional_store() -> crate::Result<()> {
    let index = Index::create_in_ram(schema);
    let reader = index.reader()?;

-    let mut rng = thread_rng();
+    let mut rng = rng();

    let mut index_writer: IndexWriter =
        index.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)?;
@@ -38,9 +38,9 @@ fn test_functional_store() -> crate::Result<()> {

    let mut doc_id = 0u64;
    for _iteration in 0..get_num_iterations() {
-        let num_docs: usize = rng.gen_range(0..4);
+        let num_docs: usize = rng.random_range(0..4);
        if !doc_set.is_empty() {
-            let doc_to_remove_id = rng.gen_range(0..doc_set.len());
+            let doc_to_remove_id = rng.random_range(0..doc_set.len());
            let removed_doc_id = doc_set.swap_remove(doc_to_remove_id);
            index_writer.delete_term(Term::from_field_u64(id_field, removed_doc_id));
        }
@@ -70,10 +70,10 @@ const LOREM: &str = "Doc Lorem ipsum dolor sit amet, consectetur adipiscing elit
                     cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat \
                     non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";
 fn get_text() -> String {
-    use rand::seq::SliceRandom;
-    let mut rng = thread_rng();
+    use rand::seq::IndexedRandom;
+    let mut rng = rng();
    let tokens: Vec<_> = LOREM.split(' ').collect();
-    let random_val = rng.gen_range(0..20);
+    let random_val = rng.random_range(0..20);

    (0..random_val)
        .map(|_| tokens.choose(&mut rng).unwrap())
@@ -101,7 +101,7 @@ fn test_functional_indexing_unsorted() -> crate::Result<()> {
    let index = Index::create_from_tempdir(schema)?;
    let reader = index.reader()?;

-    let mut rng = thread_rng();
+    let mut rng = rng();

    let mut index_writer: IndexWriter =
        index.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)?;
@@ -110,7 +110,7 @@ fn test_functional_indexing_unsorted() -> crate::Result<()> {
    let mut uncommitted_docs: HashSet<u64> = HashSet::new();

    for _ in 0..get_num_iterations() {
-        let random_val = rng.gen_range(0..20);
+        let random_val = rng.random_range(0..20);
        if random_val == 0 {
            index_writer.commit()?;
            committed_docs.extend(&uncommitted_docs);
--- a/src/index/index_meta.rs
+++ b/src/index/index_meta.rs
@@ -1,8 +1,6 @@
 use std::collections::HashSet;
 use std::fmt;
 use std::path::PathBuf;
-use std::sync::atomic::AtomicBool;
-use std::sync::Arc;

 use serde::{Deserialize, Serialize};

@@ -37,7 +35,6 @@ impl SegmentMetaInventory {
        let inner = InnerSegmentMeta {
            segment_id,
            max_doc,
-            include_temp_doc_store: Arc::new(AtomicBool::new(true)),
            deletes: None,
        };
        SegmentMeta::from(self.inventory.track(inner))
@@ -85,15 +82,6 @@ impl SegmentMeta {
        self.tracked.segment_id
    }

-    /// Removes the Component::TempStore from the alive list and
-    /// therefore marks the temp docstore file to be deleted by
-    /// the garbage collection.
-    pub fn untrack_temp_docstore(&self) {
-        self.tracked
-            .include_temp_doc_store
-            .store(false, std::sync::atomic::Ordering::Relaxed);
-    }
-
    /// Returns the number of deleted documents.
    pub fn num_deleted_docs(&self) -> u32 {
        self.tracked
@@ -111,20 +99,9 @@ impl SegmentMeta {
    /// is by removing all files that have been created by tantivy
    /// and are not used by any segment anymore.
    pub fn list_files(&self) -> HashSet<PathBuf> {
-        if self
-            .tracked
-            .include_temp_doc_store
-            .load(std::sync::atomic::Ordering::Relaxed)
-        {
-            SegmentComponent::iterator()
-                .map(|component| self.relative_path(*component))
-                .collect::<HashSet<PathBuf>>()
-        } else {
-            SegmentComponent::iterator()
-                .filter(|comp| *comp != &SegmentComponent::TempStore)
-                .map(|component| self.relative_path(*component))
-                .collect::<HashSet<PathBuf>>()
-        }
+        SegmentComponent::iterator()
+            .map(|component| self.relative_path(*component))
+            .collect::<HashSet<PathBuf>>()
    }

    /// Returns the relative path of a component of our segment.
@@ -138,7 +115,6 @@ impl SegmentMeta {
            SegmentComponent::Positions => ".pos".to_string(),
            SegmentComponent::Terms => ".term".to_string(),
            SegmentComponent::Store => ".store".to_string(),
-            SegmentComponent::TempStore => ".store.temp".to_string(),
            SegmentComponent::FastFields => ".fast".to_string(),
            SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
            SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
@@ -183,7 +159,6 @@ impl SegmentMeta {
            segment_id: inner_meta.segment_id,
            max_doc,
            deletes: None,
-            include_temp_doc_store: Arc::new(AtomicBool::new(true)),
        });
        SegmentMeta { tracked }
    }
@@ -202,7 +177,6 @@ impl SegmentMeta {
        let tracked = self.tracked.map(move |inner_meta| InnerSegmentMeta {
            segment_id: inner_meta.segment_id,
            max_doc: inner_meta.max_doc,
-            include_temp_doc_store: Arc::new(AtomicBool::new(true)),
            deletes: Some(delete_meta),
        });
        SegmentMeta { tracked }
@@ -214,14 +188,6 @@ struct InnerSegmentMeta {
    segment_id: SegmentId,
    max_doc: u32,
    pub deletes: Option<DeleteMeta>,
-    /// If you want to avoid the SegmentComponent::TempStore file to be covered by
-    /// garbage collection and deleted, set this to true. This is used during merge.
-    #[serde(skip)]
-    #[serde(default = "default_temp_store")]
-    pub(crate) include_temp_doc_store: Arc<AtomicBool>,
-}
-fn default_temp_store() -> Arc<AtomicBool> {
-    Arc::new(AtomicBool::new(false))
 }

 impl InnerSegmentMeta {
--- a/src/index/segment_component.rs
+++ b/src/index/segment_component.rs
@@ -23,8 +23,6 @@ pub enum SegmentComponent {
    /// Accessing a document from the store is relatively slow, as it
    /// requires to decompress the entire block it belongs to.
    Store,
-    /// Temporary storage of the documents, before streamed to `Store`.
-    TempStore,
    /// Bitset describing which document of the segment is alive.
    /// (It was representing deleted docs but changed to represent alive docs from v0.17)
    Delete,
@@ -33,14 +31,13 @@ pub enum SegmentComponent {
 impl SegmentComponent {
    /// Iterates through the components.
    pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
-        static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
+        static SEGMENT_COMPONENTS: [SegmentComponent; 7] = [
            SegmentComponent::Postings,
            SegmentComponent::Positions,
            SegmentComponent::FastFields,
            SegmentComponent::FieldNorms,
            SegmentComponent::Terms,
            SegmentComponent::Store,
-            SegmentComponent::TempStore,
            SegmentComponent::Delete,
        ];
        SEGMENT_COMPONENTS.iter()
--- a/src/indexer/index_writer.rs
+++ b/src/indexer/index_writer.rs
@@ -218,7 +218,7 @@ fn index_documents<D: Document>(
    let alive_bitset_opt = apply_deletes(&segment_with_max_doc, &mut delete_cursor, &doc_opstamps)?;

    let meta = segment_with_max_doc.meta().clone();
-    meta.untrack_temp_docstore();
+
    // update segment_updater inventory to remove tempstore
    let segment_entry = SegmentEntry::new(meta, delete_cursor, alive_bitset_opt);
    segment_updater.schedule_add_segment(segment_entry).wait()?;
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -377,7 +377,7 @@ pub mod tests {

    use common::{BinarySerializable, FixedSize};
    use query_grammar::{UserInputAst, UserInputLeaf, UserInputLiteral};
-    use rand::distributions::{Bernoulli, Uniform};
+    use rand::distr::{Bernoulli, Uniform};
    use rand::rngs::StdRng;
    use rand::{Rng, SeedableRng};
    use time::OffsetDateTime;
@@ -428,7 +428,7 @@ pub mod tests {
    pub fn generate_nonunique_unsorted(max_value: u32, n_elems: usize) -> Vec<u32> {
        let seed: [u8; 32] = [1; 32];
        StdRng::from_seed(seed)
-            .sample_iter(&Uniform::new(0u32, max_value))
+            .sample_iter(&Uniform::new(0u32, max_value).unwrap())
            .take(n_elems)
            .collect::<Vec<u32>>()
    }
--- a/src/postings/block_segment_postings.rs
+++ b/src/postings/block_segment_postings.rs
@@ -303,10 +303,10 @@ impl BlockSegmentPostings {
    }

    pub(crate) fn load_block(&mut self) {
-        let offset = self.skip_reader.byte_offset();
        if self.block_is_loaded() {
            return;
        }
+        let offset = self.skip_reader.byte_offset();
        match self.skip_reader.block_info() {
            BlockInfo::BitPacked {
                doc_num_bits,
--- a/src/postings/compression/mod.rs
+++ b/src/postings/compression/mod.rs
@@ -397,7 +397,10 @@ mod bench {
        let mut seed: [u8; 32] = [0; 32];
        seed[31] = seed_val;
        let mut rng = StdRng::from_seed(seed);
-        (0u32..).filter(|_| rng.gen_bool(ratio)).take(n).collect()
+        (0u32..)
+            .filter(|_| rng.random_bool(ratio))
+            .take(n)
+            .collect()
    }

    pub fn generate_array(n: usize, ratio: f64) -> Vec<u32> {
--- a/src/postings/mod.rs
+++ b/src/postings/mod.rs
@@ -604,13 +604,13 @@ mod bench {
            let mut index_writer: IndexWriter = index.writer_for_tests().unwrap();
            for _ in 0..posting_list_size {
                let mut doc = TantivyDocument::default();
-                if rng.gen_bool(1f64 / 15f64) {
+                if rng.random_bool(1f64 / 15f64) {
                    doc.add_text(text_field, "a");
                }
-                if rng.gen_bool(1f64 / 10f64) {
+                if rng.random_bool(1f64 / 10f64) {
                    doc.add_text(text_field, "b");
                }
-                if rng.gen_bool(1f64 / 5f64) {
+                if rng.random_bool(1f64 / 5f64) {
                    doc.add_text(text_field, "c");
                }
                doc.add_text(text_field, "d");
--- a/src/postings/segment_postings.rs
+++ b/src/postings/segment_postings.rs
@@ -70,13 +70,13 @@ impl SegmentPostings {
        let mut buffer = Vec::new();
        {
            let mut postings_serializer =
-                PostingsSerializer::new(&mut buffer, 0.0, IndexRecordOption::Basic, None);
+                PostingsSerializer::new(0.0, IndexRecordOption::Basic, None);
            postings_serializer.new_term(docs.len() as u32, false);
            for &doc in docs {
                postings_serializer.write_doc(doc, 1u32);
            }
            postings_serializer
-                .close_term(docs.len() as u32)
+                .close_term(docs.len() as u32, &mut buffer)
                .expect("In memory Serialization should never fail.");
        }
        let block_segment_postings = BlockSegmentPostings::open(
@@ -115,7 +115,6 @@ impl SegmentPostings {
            })
            .unwrap_or(0.0);
        let mut postings_serializer = PostingsSerializer::new(
-            &mut buffer,
            average_field_norm,
            IndexRecordOption::WithFreqs,
            fieldnorm_reader,
@@ -125,7 +124,7 @@ impl SegmentPostings {
            postings_serializer.write_doc(doc, tf);
        }
        postings_serializer
-            .close_term(doc_and_tfs.len() as u32)
+            .close_term(doc_and_tfs.len() as u32, &mut buffer)
            .unwrap();
        let block_segment_postings = BlockSegmentPostings::open(
            doc_and_tfs.len() as u32,
@@ -169,12 +168,20 @@ impl DocSet for SegmentPostings {
        self.doc()
    }

+    #[inline]
    fn seek(&mut self, target: DocId) -> DocId {
        debug_assert!(self.doc() <= target);
        if self.doc() >= target {
            return self.doc();
        }

+        // As an optimization, if the block is already loaded, we can
+        // cheaply check the next doc.
+        self.cur = (self.cur + 1).min(COMPRESSION_BLOCK_SIZE - 1);
+        if self.doc() >= target {
+            return self.doc();
+        }
+
        // Delegate block-local search to BlockSegmentPostings::seek, which returns
        // the in-block index of the first doc >= target.
        self.cur = self.block_cursor.seek(target);
--- a/src/postings/serializer.rs
+++ b/src/postings/serializer.rs
@@ -104,10 +104,12 @@ impl InvertedIndexSerializer {
 /// the serialization of a specific field.
 pub struct FieldSerializer<'a> {
    term_dictionary_builder: TermDictionaryBuilder<&'a mut CountingWriter<WritePtr>>,
-    postings_serializer: PostingsSerializer<&'a mut CountingWriter<WritePtr>>,
+    postings_serializer: PostingsSerializer,
    positions_serializer_opt: Option<PositionSerializer<&'a mut CountingWriter<WritePtr>>>,
    current_term_info: TermInfo,
    term_open: bool,
+    postings_write: &'a mut CountingWriter<WritePtr>,
+    postings_start_offset: u64,
 }

 impl<'a> FieldSerializer<'a> {
@@ -128,27 +130,30 @@ impl<'a> FieldSerializer<'a> {
            .as_ref()
            .map(|ff_reader| total_num_tokens as Score / ff_reader.num_docs() as Score)
            .unwrap_or(0.0);
-        let postings_serializer = PostingsSerializer::new(
-            postings_write,
-            average_fieldnorm,
-            index_record_option,
-            fieldnorm_reader,
-        );
+        let postings_serializer =
+            PostingsSerializer::new(average_fieldnorm, index_record_option, fieldnorm_reader);
        let positions_serializer_opt = if index_record_option.has_positions() {
            Some(PositionSerializer::new(positions_write))
        } else {
            None
        };

+        let postings_start_offset = postings_write.written_bytes();
        Ok(FieldSerializer {
            term_dictionary_builder,
            postings_serializer,
            positions_serializer_opt,
            current_term_info: TermInfo::default(),
            term_open: false,
+            postings_write,
+            postings_start_offset,
        })
    }

+    fn postings_offset(&self) -> usize {
+        (self.postings_write.written_bytes() - self.postings_start_offset) as usize
+    }
+
    fn current_term_info(&self) -> TermInfo {
        let positions_start =
            if let Some(positions_serializer) = self.positions_serializer_opt.as_ref() {
@@ -156,7 +161,7 @@ impl<'a> FieldSerializer<'a> {
            } else {
                0u64
            } as usize;
-        let addr = self.postings_serializer.written_bytes() as usize;
+        let addr = self.postings_offset();
        TermInfo {
            doc_freq: 0,
            postings_range: addr..addr,
@@ -213,21 +218,22 @@ impl<'a> FieldSerializer<'a> {
        crate::fail_point!("FieldSerializer::close_term", |msg: Option<String>| {
            Err(io::Error::new(io::ErrorKind::Other, format!("{msg:?}")))
        });
-        if self.term_open {
-            self.postings_serializer
-                .close_term(self.current_term_info.doc_freq)?;
-            self.current_term_info.postings_range.end =
-                self.postings_serializer.written_bytes() as usize;

-            if let Some(positions_serializer) = self.positions_serializer_opt.as_mut() {
-                positions_serializer.close_term()?;
-                self.current_term_info.positions_range.end =
-                    positions_serializer.written_bytes() as usize;
-            }
-            self.term_dictionary_builder
-                .insert_value(&self.current_term_info)?;
-            self.term_open = false;
+        if !self.term_open {
+            return Ok(());
+        };
+
+        self.postings_serializer
+            .close_term(self.current_term_info.doc_freq, self.postings_write)?;
+        self.current_term_info.postings_range.end = self.postings_offset();
+        if let Some(positions_serializer) = self.positions_serializer_opt.as_mut() {
+            positions_serializer.close_term()?;
+            self.current_term_info.positions_range.end =
+                positions_serializer.written_bytes() as usize;
        }
+        self.term_dictionary_builder
+            .insert_value(&self.current_term_info)?;
+        self.term_open = false;
        Ok(())
    }

@@ -237,7 +243,7 @@ impl<'a> FieldSerializer<'a> {
        if let Some(positions_serializer) = self.positions_serializer_opt {
            positions_serializer.close()?;
        }
-        self.postings_serializer.close()?;
+        self.postings_write.flush()?;
        self.term_dictionary_builder.finish()?;
        Ok(())
    }
@@ -291,8 +297,7 @@ impl Block {
    }
 }

-pub struct PostingsSerializer<W: Write> {
-    output_write: CountingWriter<W>,
+pub struct PostingsSerializer {
    last_doc_id_encoded: u32,

    block_encoder: BlockEncoder,
@@ -310,16 +315,13 @@ pub struct PostingsSerializer<W: Write> {
    term_has_freq: bool,
 }

-impl<W: Write> PostingsSerializer<W> {
+impl PostingsSerializer {
    pub fn new(
-        write: W,
        avg_fieldnorm: Score,
        mode: IndexRecordOption,
        fieldnorm_reader: Option<FieldNormReader>,
-    ) -> PostingsSerializer<W> {
+    ) -> PostingsSerializer {
        PostingsSerializer {
-            output_write: CountingWriter::wrap(write),
-
            block_encoder: BlockEncoder::new(),
            block: Box::new(Block::new()),

@@ -422,11 +424,11 @@ impl<W: Write> PostingsSerializer<W> {
        }
    }

-    fn close(mut self) -> io::Result<()> {
-        self.postings_write.flush()
-    }
-
-    pub fn close_term(&mut self, doc_freq: u32) -> io::Result<()> {
+    pub fn close_term(
+        &mut self,
+        doc_freq: u32,
+        output_write: &mut impl std::io::Write,
+    ) -> io::Result<()> {
        if !self.block.is_empty() {
            // we have doc ids waiting to be written
            // this happens when the number of doc ids is
@@ -451,26 +453,16 @@ impl<W: Write> PostingsSerializer<W> {
        }
        if doc_freq >= COMPRESSION_BLOCK_SIZE as u32 {
            let skip_data = self.skip_write.data();
-            VInt(skip_data.len() as u64).serialize(&mut self.output_write)?;
-            self.output_write.write_all(skip_data)?;
+            VInt(skip_data.len() as u64).serialize(output_write)?;
+            output_write.write_all(skip_data)?;
        }
-        self.output_write.write_all(&self.postings_write[..])?;
+        output_write.write_all(&self.postings_write[..])?;
        self.skip_write.clear();
        self.postings_write.clear();
        self.bm25_weight = None;
        Ok(())
    }

-    /// Returns the number of bytes written in the postings write object
-    /// at this point.
-    /// When called before writing the postings of a term, this value is used as
-    /// start offset.
-    /// When called after writing the postings of a term, this value is used as a
-    /// end offset.
-    fn written_bytes(&self) -> u64 {
-        self.output_write.written_bytes()
-    }
-
    fn clear(&mut self) {
        self.block.clear();
        self.last_doc_id_encoded = 0;
--- a/src/query/boolean_query/boolean_weight.rs
+++ b/src/query/boolean_query/boolean_weight.rs
@@ -291,18 +291,6 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
            }
        };

-        let exclude_scorer_opt: Option<Box<dyn Scorer>> = if exclude_scorers.is_empty() {
-            None
-        } else {
-            let exclude_specialized_scorer: SpecializedScorer =
-                scorer_union(exclude_scorers, DoNothingCombiner::default, num_docs);
-            Some(into_box_scorer(
-                exclude_specialized_scorer,
-                DoNothingCombiner::default,
-                num_docs,
-            ))
-        };
-
        let include_scorer = match (should_scorers, must_scorers) {
            (ShouldScorersCombinationMethod::Ignored, must_scorers) => {
                // No SHOULD clauses (or they were absorbed into MUST).
@@ -380,16 +368,23 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
                }
            }
        };
-        if let Some(exclude_scorer) = exclude_scorer_opt {
-            let include_scorer_boxed =
-                into_box_scorer(include_scorer, &score_combiner_fn, num_docs);
-            Ok(SpecializedScorer::Other(Box::new(Exclude::new(
-                include_scorer_boxed,
-                exclude_scorer,
-            ))))
-        } else {
-            Ok(include_scorer)
+        if exclude_scorers.is_empty() {
+            return Ok(include_scorer);
        }
+
+        let include_scorer_boxed = into_box_scorer(include_scorer, &score_combiner_fn, num_docs);
+        let scorer: Box<dyn Scorer> = if exclude_scorers.len() == 1 {
+            let exclude_scorer = exclude_scorers.pop().unwrap();
+            match exclude_scorer.downcast::<TermScorer>() {
+                // Cast to TermScorer succeeded
+                Ok(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, *exclude_scorer)),
+                // We get back the original Box<dyn Scorer>
+                Err(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, exclude_scorer)),
+            }
+        } else {
+            Box::new(Exclude::new(include_scorer_boxed, exclude_scorers))
+        };
+        Ok(SpecializedScorer::Other(scorer))
    }
 }

--- a/src/query/boost_query.rs
+++ b/src/query/boost_query.rs
@@ -1,6 +1,6 @@
 use std::fmt;

-use crate::docset::COLLECT_BLOCK_BUFFER_LEN;
+use crate::docset::{SeekDangerResult, COLLECT_BLOCK_BUFFER_LEN};
 use crate::fastfield::AliveBitSet;
 use crate::query::{EnableScoring, Explanation, Query, Scorer, Weight};
 use crate::{DocId, DocSet, Score, SegmentReader, Term};
@@ -104,8 +104,8 @@ impl<S: Scorer> DocSet for BoostScorer<S> {
    fn seek(&mut self, target: DocId) -> DocId {
        self.underlying.seek(target)
    }
-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
-        self.underlying.seek_into_the_danger_zone(target)
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
+        self.underlying.seek_danger(target)
    }

    fn fill_buffer(&mut self, buffer: &mut [DocId; COLLECT_BLOCK_BUFFER_LEN]) -> usize {
--- a/src/query/disjunction.rs
+++ b/src/query/disjunction.rs
@@ -1,6 +1,7 @@
 use std::cmp::Ordering;
 use std::collections::BinaryHeap;

+use crate::docset::SeekDangerResult;
 use crate::query::score_combiner::DoNothingCombiner;
 use crate::query::{ScoreCombiner, Scorer};
 use crate::{DocId, DocSet, Score, TERMINATED};
@@ -67,10 +68,12 @@ impl<T: Scorer> DocSet for ScorerWrapper<T> {
        self.current_doc = doc_id;
        doc_id
    }
-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
-        let found = self.scorer.seek_into_the_danger_zone(target);
-        self.current_doc = self.scorer.doc();
-        found
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
+        let result = self.scorer.seek_danger(target);
+        if result == SeekDangerResult::Found {
+            self.current_doc = target;
+        }
+        result
    }

    fn doc(&self) -> DocId {
--- a/src/query/exclude.rs
+++ b/src/query/exclude.rs
@@ -1,48 +1,71 @@
-use crate::docset::{DocSet, TERMINATED};
+use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
 use crate::query::Scorer;
 use crate::{DocId, Score};

-#[inline]
-fn is_within<TDocSetExclude: DocSet>(docset: &mut TDocSetExclude, doc: DocId) -> bool {
-    docset.doc() <= doc && docset.seek(doc) == doc
-}
-
-/// Filters a given `DocSet` by removing the docs from a given `DocSet`.
+/// An exclusion set is a set of documents
+/// that should be excluded from a given DocSet.
 ///
-/// The excluding docset has no impact on scoring.
-pub struct Exclude<TDocSet, TDocSetExclude> {
-    underlying_docset: TDocSet,
-    excluding_docset: TDocSetExclude,
+/// It can be a single DocSet, or a Vec of DocSets.
+pub trait ExclusionSet: Send {
+    /// Returns `true` if the given `doc` is in the exclusion set.
+    fn contains(&mut self, doc: DocId) -> bool;
 }

-impl<TDocSet, TDocSetExclude> Exclude<TDocSet, TDocSetExclude>
+impl<TDocSet: DocSet> ExclusionSet for TDocSet {
+    #[inline]
+    fn contains(&mut self, doc: DocId) -> bool {
+        self.seek_danger(doc) == SeekDangerResult::Found
+    }
+}
+
+impl<TDocSet: DocSet> ExclusionSet for Vec<TDocSet> {
+    #[inline]
+    fn contains(&mut self, doc: DocId) -> bool {
+        for docset in self.iter_mut() {
+            if docset.seek_danger(doc) == SeekDangerResult::Found {
+                return true;
+            }
+        }
+        false
+    }
+}
+
+/// Filters a given `DocSet` by removing the docs from an exclusion set.
+///
+/// The excluding docsets have no impact on scoring.
+pub struct Exclude<TDocSet, TExclusionSet> {
+    underlying_docset: TDocSet,
+    exclusion_set: TExclusionSet,
+}
+
+impl<TDocSet, TExclusionSet> Exclude<TDocSet, TExclusionSet>
 where
    TDocSet: DocSet,
-    TDocSetExclude: DocSet,
+    TExclusionSet: ExclusionSet,
 {
    /// Creates a new `ExcludeScorer`
    pub fn new(
        mut underlying_docset: TDocSet,
-        mut excluding_docset: TDocSetExclude,
-    ) -> Exclude<TDocSet, TDocSetExclude> {
+        mut exclusion_set: TExclusionSet,
+    ) -> Exclude<TDocSet, TExclusionSet> {
        while underlying_docset.doc() != TERMINATED {
            let target = underlying_docset.doc();
-            if !is_within(&mut excluding_docset, target) {
+            if !exclusion_set.contains(target) {
                break;
            }
            underlying_docset.advance();
        }
        Exclude {
            underlying_docset,
-            excluding_docset,
+            exclusion_set,
        }
    }
 }

-impl<TDocSet, TDocSetExclude> DocSet for Exclude<TDocSet, TDocSetExclude>
+impl<TDocSet, TExclusionSet> DocSet for Exclude<TDocSet, TExclusionSet>
 where
    TDocSet: DocSet,
-    TDocSetExclude: DocSet,
+    TExclusionSet: ExclusionSet,
 {
    fn advance(&mut self) -> DocId {
        loop {
@@ -50,7 +73,7 @@ where
            if candidate == TERMINATED {
                return TERMINATED;
            }
-            if !is_within(&mut self.excluding_docset, candidate) {
+            if !self.exclusion_set.contains(candidate) {
                return candidate;
            }
        }
@@ -61,7 +84,7 @@ where
        if candidate == TERMINATED {
            return TERMINATED;
        }
-        if !is_within(&mut self.excluding_docset, candidate) {
+        if !self.exclusion_set.contains(candidate) {
            return candidate;
        }
        self.advance()
@@ -79,10 +102,10 @@ where
    }
 }

-impl<TScorer, TDocSetExclude> Scorer for Exclude<TScorer, TDocSetExclude>
+impl<TScorer, TExclusionSet> Scorer for Exclude<TScorer, TExclusionSet>
 where
    TScorer: Scorer,
-    TDocSetExclude: DocSet + 'static,
+    TExclusionSet: ExclusionSet + 'static,
 {
    #[inline]
    fn score(&mut self) -> Score {
--- a/src/query/intersection.rs
+++ b/src/query/intersection.rs
@@ -1,5 +1,5 @@
 use super::size_hint::estimate_intersection;
-use crate::docset::{DocSet, TERMINATED};
+use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
 use crate::query::term_query::TermScorer;
 use crate::query::{EmptyScorer, Scorer};
 use crate::{DocId, Score};
@@ -84,6 +84,14 @@ impl<TDocSet: DocSet> Intersection<TDocSet, TDocSet> {
        docsets.sort_by_key(|docset| docset.cost());
        go_to_first_doc(&mut docsets);
        let left = docsets.remove(0);
+        debug_assert!({
+            let doc = left.doc();
+            if doc == TERMINATED {
+                true
+            } else {
+                docsets.iter().all(|docset| docset.doc() == doc)
+            }
+        });
        let right = docsets.remove(0);
        Intersection {
            left,
@@ -108,46 +116,61 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
    #[inline]
    fn advance(&mut self) -> DocId {
        let (left, right) = (&mut self.left, &mut self.right);
-        let mut candidate = left.advance();
-        if candidate == TERMINATED {
-            return TERMINATED;
-        }

-        loop {
-            // In the first part we look for a document in the intersection
-            // of the two rarest `DocSet` in the intersection.
+        // Invariant:
+        // - candidate is always <= to the next document in the intersection.
+        // - candidate strictly increases at every occurence of the loop.
+        let mut candidate = left.doc() + 1;

-            loop {
-                if right.seek_into_the_danger_zone(candidate) {
-                    break;
-                }
-                let right_doc = right.doc();
-                // TODO: Think about which value would make sense here
-                // It depends on the DocSet implementation, when a seek would outweigh an advance.
-                if right_doc > candidate.wrapping_add(100) {
-                    candidate = left.seek(right_doc);
-                } else {
-                    candidate = left.advance();
-                }
-                if candidate == TERMINATED {
-                    return TERMINATED;
-                }
-            }
+        // Termination: candidate strictly increases.
+        'outer: while candidate < TERMINATED {
+            // As we enter the loop, we should always have candidate < next_doc.

-            debug_assert_eq!(left.doc(), right.doc());
-            // test the remaining scorers
-            if self
-                .others
-                .iter_mut()
-                .all(|docset| docset.seek_into_the_danger_zone(candidate))
+            candidate = left.seek(candidate);
+
+            // Left is positionned on `candidate`.
+            debug_assert_eq!(left.doc(), candidate);
+
+            if let SeekDangerResult::SeekLowerBound(seek_lower_bound) = right.seek_danger(candidate)
            {
-                debug_assert_eq!(candidate, self.left.doc());
-                debug_assert_eq!(candidate, self.right.doc());
-                debug_assert!(self.others.iter().all(|docset| docset.doc() == candidate));
-                return candidate;
+                debug_assert!(
+                    seek_lower_bound == TERMINATED || seek_lower_bound > candidate,
+                    "seek_lower_bound {seek_lower_bound} must be greater than candidate \
+                     {candidate}"
+                );
+                candidate = seek_lower_bound;
+                continue;
            }
-            candidate = left.advance();
+
+            // Left and right are positionned on `candidate`.
+            debug_assert_eq!(right.doc(), candidate);
+
+            for other in &mut self.others {
+                if let SeekDangerResult::SeekLowerBound(seek_lower_bound) =
+                    other.seek_danger(candidate)
+                {
+                    // One of the scorer does not match, let's restart at the top of the loop.
+                    debug_assert!(
+                        seek_lower_bound == TERMINATED || seek_lower_bound > candidate,
+                        "seek_lower_bound {seek_lower_bound} must be greater than candidate \
+                         {candidate}"
+                    );
+                    candidate = seek_lower_bound;
+                    continue 'outer;
+                }
+            }
+
+            // At this point all scorers are in a valid state, aligned on the next document in the
+            // intersection.
+            debug_assert!(self.others.iter().all(|docset| docset.doc() == candidate));
+            return candidate;
        }
+
+        // We make sure our docset is in a valid state.
+        // In particular, we want .doc() to return TERMINATED.
+        left.seek(TERMINATED);
+
+        TERMINATED
    }

    fn seek(&mut self, target: DocId) -> DocId {
@@ -166,13 +189,19 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
    ///
    /// Some implementations may choose to advance past the target if beneficial for performance.
    /// The return value is `true` if the target is in the docset, and `false` otherwise.
-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
-        self.left.seek_into_the_danger_zone(target)
-            && self.right.seek_into_the_danger_zone(target)
-            && self
-                .others
-                .iter_mut()
-                .all(|docset| docset.seek_into_the_danger_zone(target))
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
+        if let SeekDangerResult::SeekLowerBound(new_target) = self.left.seek_danger(target) {
+            return SeekDangerResult::SeekLowerBound(new_target);
+        }
+        if let SeekDangerResult::SeekLowerBound(new_target) = self.right.seek_danger(target) {
+            return SeekDangerResult::SeekLowerBound(new_target);
+        }
+        for docset in &mut self.others {
+            if let SeekDangerResult::SeekLowerBound(new_target) = docset.seek_danger(target) {
+                return SeekDangerResult::SeekLowerBound(new_target);
+            }
+        }
+        SeekDangerResult::Found
    }

    #[inline]
@@ -215,9 +244,12 @@ mod tests {
    use proptest::prelude::*;

    use super::Intersection;
+    use crate::collector::Count;
    use crate::docset::{DocSet, TERMINATED};
    use crate::postings::tests::test_skip_against_unoptimized;
-    use crate::query::VecDocSet;
+    use crate::query::{QueryParser, VecDocSet};
+    use crate::schema::{Schema, TEXT};
+    use crate::Index;

    #[test]
    fn test_intersection() {
@@ -304,6 +336,58 @@ mod tests {
        assert_eq!(intersection.doc(), TERMINATED);
    }

+    #[test]
+    fn test_intersection_abc() {
+        let a = VecDocSet::from(vec![2, 3, 6]);
+        let b = VecDocSet::from(vec![1, 3, 5]);
+        let c = VecDocSet::from(vec![1, 3, 5]);
+        let mut intersection = Intersection::new(vec![c, b, a], 10);
+        let mut docs = Vec::new();
+        use crate::DocSet;
+        while intersection.doc() != TERMINATED {
+            docs.push(intersection.doc());
+            intersection.advance();
+        }
+        assert_eq!(&docs, &[3]);
+    }
+
+    #[test]
+    fn test_intersection_termination() {
+        use crate::query::score_combiner::DoNothingCombiner;
+        use crate::query::{BufferedUnionScorer, ConstScorer, VecDocSet};
+
+        let a1 = ConstScorer::new(VecDocSet::from(vec![0u32, 10000]), 1.0);
+        let a2 = ConstScorer::new(VecDocSet::from(vec![0u32, 10000]), 1.0);
+
+        let mut b_scorers = vec![];
+        for _ in 0..2 {
+            // Union matches 0 and 10000.
+            b_scorers.push(ConstScorer::new(VecDocSet::from(vec![0, 10000]), 1.0));
+        }
+        // That's the union of two scores matching 0, and 10_000.
+        let union = BufferedUnionScorer::build(b_scorers, DoNothingCombiner::default, 30000);
+
+        // Mismatching scorer: matches 0 and 20000. We then append more docs at the end to ensure it
+        // is last.
+        let mut m_docs = vec![0, 20000];
+        for i in 30000..30100 {
+            m_docs.push(i);
+        }
+        let m = ConstScorer::new(VecDocSet::from(m_docs), 1.0);
+
+        // Costs: A1=2, A2=2, Union=4, M=102.
+        // Sorted: A1, A2, Union, M.
+        // Left=A1, Right=A2, Others=[Union, M].
+        let mut intersection = crate::query::intersect_scorers(
+            vec![Box::new(a1), Box::new(a2), Box::new(union), Box::new(m)],
+            40000,
+        );
+
+        while intersection.doc() != TERMINATED {
+            intersection.advance();
+        }
+    }
+
    // Strategy to generate sorted and deduplicated vectors of u32 document IDs
    fn sorted_deduped_vec(max_val: u32, max_size: usize) -> impl Strategy<Value = Vec<u32>> {
        prop::collection::vec(0..max_val, 0..max_size).prop_map(|mut vec| {
@@ -335,6 +419,30 @@ mod tests {
            }
            assert_eq!(intersection.doc(), TERMINATED);
        }
+    }

+    #[test]
+    fn test_bug_2811_intersection_candidate_should_increase() {
+        let mut schema_builder = Schema::builder();
+        let text_field = schema_builder.add_text_field("text", TEXT);
+        let schema = schema_builder.build();
+
+        let index = Index::create_in_ram(schema);
+        let mut writer = index.writer_for_tests().unwrap();
+        writer
+            .add_document(doc!(text_field=>"hello happy tax"))
+            .unwrap();
+        writer.add_document(doc!(text_field=>"hello")).unwrap();
+        writer.add_document(doc!(text_field=>"hello")).unwrap();
+        writer.add_document(doc!(text_field=>"happy tax")).unwrap();
+
+        writer.commit().unwrap();
+        let query_parser = QueryParser::for_index(&index, Vec::new());
+        let query = query_parser
+            .parse_query(r#"+text:hello +text:"happy tax""#)
+            .unwrap();
+        let searcher = index.reader().unwrap().searcher();
+        let c = searcher.search(&*query, &Count).unwrap();
+        assert_eq!(c, 1);
    }
 }
--- a/src/query/mod.rs
+++ b/src/query/mod.rs
@@ -43,7 +43,7 @@ pub use self::boost_query::{BoostQuery, BoostWeight};
 pub use self::const_score_query::{ConstScoreQuery, ConstScorer};
 pub use self::disjunction_max_query::DisjunctionMaxQuery;
 pub use self::empty_query::{EmptyQuery, EmptyScorer, EmptyWeight};
-pub use self::exclude::Exclude;
+pub use self::exclude::{Exclude, ExclusionSet};
 pub use self::exist_query::ExistsQuery;
 pub use self::explanation::Explanation;
 #[cfg(test)]
--- a/src/query/phrase_prefix_query/phrase_prefix_scorer.rs
+++ b/src/query/phrase_prefix_query/phrase_prefix_scorer.rs
@@ -1,4 +1,4 @@
-use crate::docset::{DocSet, TERMINATED};
+use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
 use crate::fieldnorm::FieldNormReader;
 use crate::postings::Postings;
 use crate::query::bm25::Bm25Weight;
@@ -194,11 +194,16 @@ impl<TPostings: Postings> DocSet for PhrasePrefixScorer<TPostings> {
        self.advance()
    }

-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
-        if self.phrase_scorer.seek_into_the_danger_zone(target) {
-            self.matches_prefix()
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
+        let seek_res = self.phrase_scorer.seek_danger(target);
+        if seek_res != SeekDangerResult::Found {
+            return seek_res;
+        }
+        // The intersection matched. Now let's see if we match the prefix.
+        if self.matches_prefix() {
+            SeekDangerResult::Found
        } else {
-            false
+            SeekDangerResult::SeekLowerBound(target + 1)
        }
    }

--- a/src/query/phrase_query/phrase_scorer.rs
+++ b/src/query/phrase_query/phrase_scorer.rs
@@ -1,6 +1,6 @@
 use std::cmp::Ordering;

-use crate::docset::{DocSet, TERMINATED};
+use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
 use crate::fieldnorm::FieldNormReader;
 use crate::postings::Postings;
 use crate::query::bm25::Bm25Weight;
@@ -530,12 +530,23 @@ impl<TPostings: Postings> DocSet for PhraseScorer<TPostings> {
        self.advance()
    }

-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
-        debug_assert!(target >= self.doc());
-        if self.intersection_docset.seek_into_the_danger_zone(target) && self.phrase_match() {
-            return true;
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
+        debug_assert!(
+            target >= self.doc(),
+            "target ({}) should be greater than or equal to doc ({})",
+            target,
+            self.doc()
+        );
+        let seek_res = self.intersection_docset.seek_danger(target);
+        if seek_res != SeekDangerResult::Found {
+            return seek_res;
+        }
+        // The intersection matched. Now let's see if we match the phrase.
+        if self.phrase_match() {
+            SeekDangerResult::Found
+        } else {
+            SeekDangerResult::SeekLowerBound(target + 1)
        }
-        false
    }

    fn doc(&self) -> DocId {
--- a/src/query/phrase_query/regex_phrase_weight.rs
+++ b/src/query/phrase_query/regex_phrase_weight.rs
@@ -311,7 +311,7 @@ mod tests {
        #![proptest_config(ProptestConfig::with_cases(50))]
        #[test]
        fn test_phrase_regex_with_random_strings(mut random_strings in proptest::collection::vec("[c-z ]{0,10}", 1..100), num_occurrences in 1..150_usize) {
-            let mut rng = rand::thread_rng();
+            let mut rng = rand::rng();

            // Insert "aaa ccc" the specified number of times into the list
            for _ in 0..num_occurrences {
--- a/src/query/query_parser/query_parser.rs
+++ b/src/query/query_parser/query_parser.rs
@@ -2068,6 +2068,16 @@ mod test {
            format!("Regex(Field(0), {:#?})", expected_regex).as_str(),
            false,
        );
+        let expected_regex2 = tantivy_fst::Regex::new(r".*a").unwrap();
+        test_parse_query_to_logical_ast_helper(
+            "title:(/.*b/ OR /.*a/)",
+            format!(
+                "(Regex(Field(0), {:#?}) Regex(Field(0), {:#?}))",
+                expected_regex, expected_regex2
+            )
+            .as_str(),
+            false,
+        );

        // Invalid field
        let err = parse_query_to_logical_ast("float:/.*b/", false).unwrap_err();
--- a/src/query/range_query/mod.rs
+++ b/src/query/range_query/mod.rs
@@ -19,7 +19,8 @@ pub(crate) fn is_type_valid_for_fastfield_range_query(typ: Type) -> bool {
        | Type::Bool
        | Type::Date
        | Type::Json
-        | Type::IpAddr => true,
-        Type::Facet | Type::Bytes => false,
+        | Type::IpAddr
+        | Type::Bytes => true,
+        Type::Facet => false,
    }
 }
--- a/src/query/range_query/range_query.rs
+++ b/src/query/range_query/range_query.rs
@@ -429,7 +429,7 @@ mod tests {
                docs.push(doc);
            }

-            docs.shuffle(&mut rand::thread_rng());
+            docs.shuffle(&mut rand::rng());
            let mut docs_it = docs.into_iter();
            for doc in (&mut docs_it).take(50) {
                index_writer.add_document(doc)?;
--- a/src/query/range_query/range_query_fastfield.rs
+++ b/src/query/range_query/range_query_fastfield.rs
@@ -6,8 +6,8 @@ use std::net::Ipv6Addr;
 use std::ops::{Bound, RangeInclusive};

 use columnar::{
-    Cardinality, Column, ColumnType, MonotonicallyMappableToU128, MonotonicallyMappableToU64,
-    NumericalType, StrColumn,
+    BytesColumn, Cardinality, Column, ColumnType, MonotonicallyMappableToU128,
+    MonotonicallyMappableToU64, NumericalType, StrColumn,
 };
 use common::bounds::{BoundsRange, TransformBound};

@@ -163,6 +163,25 @@ impl Weight for FastFieldRangeWeight {
            };
            let dict = str_dict_column.dictionary();

+            let bounds = self.bounds.map_bound(get_value_bytes);
+            // Get term ids for terms
+            let (lower_bound, upper_bound) =
+                dict.term_bounds_to_ord(bounds.lower_bound, bounds.upper_bound)?;
+            let fast_field_reader = reader.fast_fields();
+            let Some((column, _col_type)) =
+                fast_field_reader.u64_lenient_for_type(None, &field_name)?
+            else {
+                return Ok(Box::new(EmptyScorer));
+            };
+            search_on_u64_ff(column, boost, BoundsRange::new(lower_bound, upper_bound))
+        } else if field_type.is_bytes() {
+            let Some(bytes_column): Option<BytesColumn> =
+                reader.fast_fields().bytes(&field_name)?
+            else {
+                return Ok(Box::new(EmptyScorer));
+            };
+            let dict = bytes_column.dictionary();
+
            let bounds = self.bounds.map_bound(get_value_bytes);
            // Get term ids for terms
            let (lower_bound, upper_bound) =
@@ -491,7 +510,7 @@ mod tests {
    use common::DateTime;
    use proptest::prelude::*;
    use rand::rngs::StdRng;
-    use rand::seq::SliceRandom;
+    use rand::seq::IndexedRandom;
    use rand::SeedableRng;
    use time::format_description::well_known::Rfc3339;
    use time::OffsetDateTime;
@@ -1402,6 +1421,66 @@ mod tests {

        Ok(())
    }
+
+    #[test]
+    fn test_bytes_field_ff_range_query() -> crate::Result<()> {
+        use crate::schema::BytesOptions;
+
+        let mut schema_builder = Schema::builder();
+        let bytes_field = schema_builder
+            .add_bytes_field("data", BytesOptions::default().set_fast().set_indexed());
+        let schema = schema_builder.build();
+        let index = Index::create_in_ram(schema.clone());
+        let mut index_writer: IndexWriter = index.writer_for_tests()?;
+
+        // Insert documents with lexicographically sortable byte values
+        // Using simple byte sequences that have clear ordering
+        let values: Vec<Vec<u8>> = vec![
+            vec![0x00, 0x10],
+            vec![0x00, 0x20],
+            vec![0x00, 0x30],
+            vec![0x01, 0x00],
+            vec![0x01, 0x10],
+            vec![0x02, 0x00],
+        ];
+
+        for value in &values {
+            let mut doc = TantivyDocument::new();
+            doc.add_bytes(bytes_field, value);
+            index_writer.add_document(doc)?;
+        }
+        index_writer.commit()?;
+
+        let reader = index.reader()?;
+        let searcher = reader.searcher();
+
+        // Test: Range query [0x00, 0x20] to [0x01, 0x00] (inclusive)
+        // Should match: [0x00, 0x20], [0x00, 0x30], [0x01, 0x00]
+        let lower = Term::from_field_bytes(bytes_field, &[0x00, 0x20]);
+        let upper = Term::from_field_bytes(bytes_field, &[0x01, 0x00]);
+        let range_query = RangeQuery::new(Bound::Included(lower), Bound::Included(upper));
+        let count = searcher.search(&range_query, &Count)?;
+        assert_eq!(
+            count, 3,
+            "Expected 3 documents in range [0x00,0x20] to [0x01,0x00]"
+        );
+
+        // Test: Range query > [0x01, 0x00] (exclusive lower bound)
+        // Should match: [0x01, 0x10], [0x02, 0x00]
+        let lower = Term::from_field_bytes(bytes_field, &[0x01, 0x00]);
+        let range_query = RangeQuery::new(Bound::Excluded(lower), Bound::Unbounded);
+        let count = searcher.search(&range_query, &Count)?;
+        assert_eq!(count, 2, "Expected 2 documents > [0x01,0x00]");
+
+        // Test: Range query < [0x00, 0x30] (exclusive upper bound)
+        // Should match: [0x00, 0x10], [0x00, 0x20]
+        let upper = Term::from_field_bytes(bytes_field, &[0x00, 0x30]);
+        let range_query = RangeQuery::new(Bound::Unbounded, Bound::Excluded(upper));
+        let count = searcher.search(&range_query, &Count)?;
+        assert_eq!(count, 2, "Expected 2 documents < [0x00,0x30]");
+
+        Ok(())
+    }
 }

 #[cfg(test)]
--- a/src/query/reqopt_scorer.rs
+++ b/src/query/reqopt_scorer.rs
@@ -1,6 +1,6 @@
 use std::marker::PhantomData;

-use crate::docset::DocSet;
+use crate::docset::{DocSet, SeekDangerResult};
 use crate::query::score_combiner::ScoreCombiner;
 use crate::query::Scorer;
 use crate::{DocId, Score};
@@ -56,9 +56,9 @@ where
        self.req_scorer.seek(target)
    }

-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
        self.score_cache = None;
-        self.req_scorer.seek_into_the_danger_zone(target)
+        self.req_scorer.seek_danger(target)
    }

    fn doc(&self) -> DocId {
--- a/src/query/term_query/term_scorer.rs
+++ b/src/query/term_query/term_scorer.rs
@@ -105,6 +105,7 @@ impl DocSet for TermScorer {

    #[inline]
    fn seek(&mut self, target: DocId) -> DocId {
+        debug_assert!(target >= self.doc());
        self.postings.seek(target)
    }

@@ -304,10 +305,10 @@ mod tests {
        let mut writer: IndexWriter =
            index.writer_with_num_threads(3, 3 * MEMORY_BUDGET_NUM_BYTES_MIN)?;
        use rand::Rng;
-        let mut rng = rand::thread_rng();
+        let mut rng = rand::rng();
        writer.set_merge_policy(Box::new(NoMergePolicy));
        for _ in 0..3_000 {
-            let term_freq = rng.gen_range(1..10000);
+            let term_freq = rng.random_range(1..10000);
            let words: Vec<&str> = std::iter::repeat_n("bbbb", term_freq).collect();
            let text = words.join(" ");
            writer.add_document(doc!(text_field=>text))?;
--- a/src/query/union/buffered_union.rs
+++ b/src/query/union/buffered_union.rs
@@ -1,6 +1,6 @@
 use common::TinySet;

-use crate::docset::{DocSet, TERMINATED};
+use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
 use crate::query::score_combiner::{DoNothingCombiner, ScoreCombiner};
 use crate::query::size_hint::estimate_union;
 use crate::query::Scorer;
@@ -225,25 +225,47 @@ where
        }
    }

-    fn seek_into_the_danger_zone(&mut self, target: DocId) -> bool {
+    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
+        if target >= TERMINATED {
+            return SeekDangerResult::SeekLowerBound(TERMINATED);
+        }
        if self.is_in_horizon(target) {
            // Our value is within the buffered horizon and the docset may already have been
            // processed and removed, so we need to use seek, which uses the regular advance.
-            self.seek(target) == target
-        } else {
-            // The docsets are not in the buffered range, so we can use seek_into_the_danger_zone
-            // of the underlying docsets
-            let is_hit = self
-                .docsets
-                .iter_mut()
-                .any(|docset| docset.seek_into_the_danger_zone(target));
+            let seek_doc = self.seek(target);
+            if seek_doc == target {
+                return SeekDangerResult::Found;
+            } else {
+                return SeekDangerResult::SeekLowerBound(seek_doc);
+            };
+        }

-            // The API requires the DocSet to be in a valid state when `seek_into_the_danger_zone`
-            // returns true.
-            if is_hit {
-                self.seek(target);
+        // The docsets are not in the buffered range, so we can use seek_into_the_danger_zone
+        // of the underlying docsets
+        let mut is_hit = false;
+        let mut min_new_target = TERMINATED;
+
+        for docset in self.docsets.iter_mut() {
+            match docset.seek_danger(target) {
+                SeekDangerResult::Found => {
+                    is_hit = true;
+                    break;
+                }
+                SeekDangerResult::SeekLowerBound(new_target) => {
+                    min_new_target = min_new_target.min(new_target);
+                }
            }
-            is_hit
+        }
+
+        // The API requires the DocSet to be in a valid state when `seek_into_the_danger_zone`
+        // returns Found.
+        if is_hit {
+            // The doc is found. Let's make sure we position the union on the target
+            // to bring it back to a valid state.
+            self.seek(target);
+            SeekDangerResult::Found
+        } else {
+            SeekDangerResult::SeekLowerBound(min_new_target)
        }
    }

--- a/src/query/union/mod.rs
+++ b/src/query/union/mod.rs
@@ -14,7 +14,7 @@ mod tests {
    use common::BitSet;

    use super::{SimpleUnion, *};
-    use crate::docset::{DocSet, TERMINATED};
+    use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
    use crate::postings::tests::test_skip_against_unoptimized;
    use crate::query::score_combiner::DoNothingCombiner;
    use crate::query::union::bitset_union::BitSetPostingUnion;
@@ -254,6 +254,27 @@ mod tests {
            vec![1, 2, 3, 7, 8, 9, 99, 100, 101, 500, 20000],
        );
    }
+
+    #[test]
+    fn test_buffered_union_seek_into_danger_zone_terminated() {
+        let scorer1 = ConstScorer::new(VecDocSet::from(vec![1, 2]), 1.0);
+        let scorer2 = ConstScorer::new(VecDocSet::from(vec![2, 3]), 1.0);
+
+        let mut union_scorer =
+            BufferedUnionScorer::build(vec![scorer1, scorer2], DoNothingCombiner::default, 100);
+
+        // Advance to end
+        while union_scorer.doc() != TERMINATED {
+            union_scorer.advance();
+        }
+
+        assert_eq!(union_scorer.doc(), TERMINATED);
+
+        assert_eq!(
+            union_scorer.seek_danger(TERMINATED),
+            SeekDangerResult::SeekLowerBound(TERMINATED)
+        );
+    }
 }

 #[cfg(all(test, feature = "unstable"))]
--- a/src/query/vec_docset.rs
+++ b/src/query/vec_docset.rs
@@ -17,6 +17,9 @@ pub struct VecDocSet {

 impl From<Vec<DocId>> for VecDocSet {
    fn from(doc_ids: Vec<DocId>) -> VecDocSet {
+        // We do not use `slice::is_sorted`, as we want to check for doc ids to be strictly
+        // sorted.
+        assert!(doc_ids.windows(2).all(|w| w[0] < w[1]));
        VecDocSet { doc_ids, cursor: 0 }
    }
 }
--- a/src/schema/field_type.rs
+++ b/src/schema/field_type.rs
@@ -223,6 +223,11 @@ impl FieldType {
        matches!(self, FieldType::Str(_))
    }

+    /// returns true if this is a bytes field
+    pub fn is_bytes(&self) -> bool {
+        matches!(self, FieldType::Bytes(_))
+    }
+
    /// returns true if this is an date field
    pub fn is_date(&self) -> bool {
        matches!(self, FieldType::Date(_))
--- a/src/space_usage/mod.rs
+++ b/src/space_usage/mod.rs
@@ -124,7 +124,6 @@ impl SegmentSpaceUsage {
            FieldNorms => PerField(self.fieldnorms().clone()),
            Terms => PerField(self.termdict().clone()),
            SegmentComponent::Store => ComponentSpaceUsage::Store(self.store().clone()),
-            SegmentComponent::TempStore => ComponentSpaceUsage::Store(self.store().clone()),
            Delete => Basic(self.deletes()),
        }
    }
--- a/src/termdict/fst_termdict/merger.rs
+++ b/src/termdict/fst_termdict/merger.rs
@@ -95,7 +95,7 @@ impl<'a> TermMerger<'a> {
 #[cfg(all(test, feature = "unstable"))]
 mod bench {
    use rand::distributions::Alphanumeric;
-    use rand::{thread_rng, Rng};
+    use rand::{rng, Rng};
    use test::{self, Bencher};

    use super::TermMerger;
@@ -117,9 +117,9 @@ mod bench {
        let buffer: Vec<u8> = {
            let mut terms = vec![];
            for _i in 0..num_terms {
-                let rand_string: String = thread_rng()
+                let rand_string: String = rng()
                    .sample_iter(&Alphanumeric)
-                    .take(thread_rng().gen_range(30..42))
+                    .take(rng().random_range(30..42))
                    .map(char::from)
                    .collect();
                terms.push(rand_string);
--- a/sstable/Cargo.toml
+++ b/sstable/Cargo.toml
@@ -25,7 +25,7 @@ zstd-compression = ["zstd"]
 proptest = "1"
 criterion = { version = "0.5", default-features = false }
 names = "0.14"
-rand = "0.8"
+rand = "0.9"

 [[bench]]
 name = "stream_bench"
--- a/sstable/benches/stream_bench.rs
+++ b/sstable/benches/stream_bench.rs
@@ -10,9 +10,9 @@ use tantivy_sstable::{Dictionary, MonotonicU64SSTable};
 const CHARSET: &[u8] = b"abcdefghij";

 fn generate_key(rng: &mut impl Rng) -> String {
-    let len = rng.gen_range(3..12);
+    let len = rng.random_range(3..12);
    std::iter::from_fn(|| {
-        let idx = rng.gen_range(0..CHARSET.len());
+        let idx = rng.random_range(0..CHARSET.len());
        Some(CHARSET[idx] as char)
    })
    .take(len)
--- a/stacker/Cargo.toml
+++ b/stacker/Cargo.toml
@@ -23,12 +23,12 @@ name = "hashmap"
 path = "example/hashmap.rs"

 [dev-dependencies]
-rand = "0.8.5"
+rand = "0.9"
 zipf = "7.0.0"
 rustc-hash = "2.1.0"
 proptest = "1.2.0"
 binggan = { version = "0.14.0" }
-rand_distr = "0.4.3"
+rand_distr = "0.5"

 [features]
 compare_hash_only = ["ahash"] # Compare hash only, not the key in the Hashmap
--- a/stacker/benches/bench.rs
+++ b/stacker/benches/bench.rs
@@ -90,10 +90,10 @@ fn bench_vint() {
            }
            // benchmark zipfs distribution numbers
            {
-                use rand::distributions::Distribution;
+                use rand::distr::Distribution;
                use rand::rngs::StdRng;
                let mut rng = StdRng::from_seed([3u8; 32]);
-                let zipf = zipf::ZipfDistribution::new(10_000, 1.03).unwrap();
+                let zipf = rand_distr::Zipf::new(10_000.0f64, 1.03).unwrap();
                let numbers: Vec<[u8; 8]> = (0..num_numbers)
                    .map(|_| zipf.sample(&mut rng).to_le_bytes())
                    .collect();
--- a/stacker/fuzz_test/Cargo.toml
+++ b/stacker/fuzz_test/Cargo.toml
@@ -7,8 +7,8 @@ edition = "2021"

 [dependencies]
 ahash = "0.8.7"
-rand = "0.8.5"
-rand_distr = "0.4.3"
+rand = "0.9"
+rand_distr = "0.5"
 tantivy-stacker = { version = "0.2.0", path = ".." }

 [workspace]
--- a/stacker/fuzz_test/src/main.rs
+++ b/stacker/fuzz_test/src/main.rs
@@ -14,7 +14,7 @@ fn test_with_seed(seed: u64) {
    let mut hash_map = AHashMap::new();
    let mut arena_hashmap = ArenaHashMap::default();
    let mut rng = StdRng::seed_from_u64(seed);
-    let key_count = rng.gen_range(1_000..=1_000_000);
+    let key_count = rng.random_range(1_000..=1_000_000);
    let exp = Exp::new(0.05).unwrap();

    for _ in 0..key_count {
Author	SHA1	Message	Date
cong.xie	698f073f88	fix fmt	2026-02-11 15:52:39 -05:00
cong.xie	cdd24b7ee5	Replace hyperloglogplus with Apache DataSketches HLL (lg_k=11) Switch tantivy's cardinality aggregation from the hyperloglogplus crate (HyperLogLog++ with p=16) to the official Apache DataSketches HLL implementation (datasketches crate v0.2.0 with lg_k=11, Hll4). This enables returning raw HLL sketch bytes from pomsky to Datadog's event query, where they can be properly deserialized and merged using the same DataSketches library (Java). The previous implementation required pomsky to fabricate fake HLL sketches from scalar cardinality estimates, which produced incorrect results when merged. Changes: - Cargo.toml: hyperloglogplus 0.4.1 -> datasketches 0.2.0 - CardinalityCollector: HyperLogLogPlus<u64, BuildSaltedHasher> -> HllSketch - Custom Serde impl using HllSketch binary format (cross-shard compat) - New to_sketch_bytes() for external consumers (pomsky) - Salt preserved via (salt, value) tuple hashing for column type disambiguation - Removed BuildSaltedHasher struct - Added 4 new unit tests (serde roundtrip, merge, binary compat, salt)	2026-02-11 08:49:46 -05:00
trinity-1686a	5562ce6037	Merge pull request #2818 from Darkheir/fix/query_grammar_regex_between_parentheses	2026-02-11 11:39:58 +01:00
Metin Dumandag	09b6ececa7	Export fields of the PercentileValuesVecEntry (#2833 ) Otherwise, there is no way to access these fields when not using the json serialized form of the aggregation results. This simple data struct is part of the public api, so its fields should be accessible as well.	2026-02-11 11:31:07 +01:00
Moe	8018016e46	feat: add fast field support for Bytes type (#100 ) (#2830 ) ## What Enable range queries and TopN sorting on `Bytes` fast fields, bringing them to parity with `Str` fields. ## Why `BytesColumn` uses the same dictionary encoding as `StrColumn` internally, but range queries and TopN sorting were explicitly disabled for `Bytes`. This prevented use cases like storing lexicographically sortable binary data (e.g., arbitrary-precision decimals) that need efficient range filtering. ## How 1. Enable range queries for Bytes - Changed `is_type_valid_for_fastfield_range_query()` to return `true` for `Type::Bytes` 2. Add BytesColumn handling in scorer - Added a branch in `FastFieldRangeWeight::scorer()` to handle bytes fields using dictionary ordinal lookup (mirrors the existing `StrColumn` logic) 3. Add SortByBytes - New sort key computer for TopN queries on bytes columns ## Tests - `test_bytes_field_ff_range_query` - Tests inclusive/exclusive bounds and unbounded ranges - `test_sort_by_bytes_asc` / `test_sort_by_bytes_desc` - Tests lexicographic ordering in both directions	2026-02-11 11:26:18 +01:00
trinity-1686a	6bf185dc3f	Merge pull request #2829 from quickwit-oss/cong.xie/add-intermediate-accessors	2026-02-10 17:07:24 +01:00
cong.xie	bb141abe22	feat(aggregation): add keys() accessor to IntermediateAggregationResults	2026-02-09 15:38:35 -05:00
cong.xie	f1c29ba972	resolve conflcit	2026-02-06 14:23:11 -05:00
cong.xie	ae0554a6a5	feat(aggregation): add public accessors for intermediate aggregation results Add accessor methods to allow external crates to read intermediate aggregation results without accessing pub(crate) fields: - IntermediateAggregationResults: get(), remove() - IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound() - IntermediateAverage: stats() - IntermediateStats: count(), sum() - IntermediateKey: Display impl for string conversion	2026-02-06 11:12:20 -05:00
cong.xie	0d7abe5d23	feat(aggregation): add public accessors for intermediate aggregation results Add accessor methods to allow external crates to read intermediate aggregation results without accessing pub(crate) fields: - IntermediateAggregationResults: get(), get_mut(), remove() - IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound() - IntermediateAverage: stats() - IntermediateStats: count(), sum() - IntermediateKey: Display impl for string conversion	2026-02-06 10:28:59 -05:00
PSeitz	28db952131	Add regex search and merge segments benchmark (#2826 ) * add merge_segments benchmark * add regex search bench	2026-02-02 17:28:02 +01:00
PSeitz	98ebbf922d	faster exclude queries (#2825 ) * faster exclude queries Faster exclude queries with multiple terms. Changes `Exclude` to be able to exclude multiple DocSets, instead of putting the docsets into a union. Use `seek_danger` in `Exclude`. closes #2822 * replace unwrap with match	2026-01-30 17:06:41 +01:00
Paul Masurel	4a89e74597	Fix rfc3339 typos and add Claude Code skills (#2823 ) Closes #2817	2026-01-30 12:00:28 +01:00
Alex Lazar	4d99e51e50	Bump oneshot to 0.1.13 per dependabot (#2821 )	2026-01-30 11:42:01 +01:00
Darkheir	a55e4069e4	feat(query-grammar): Apply PR review suggestions Signed-off-by: Darkheir <raphael.cohen@sekoia.io>	2026-01-28 14:13:55 +01:00
Darkheir	1fd30c62be	fix(query-grammar): Fix regexes between parentheses Signed-off-by: Darkheir <raphael.cohen@sekoia.io>	2026-01-28 10:37:51 +01:00
trinity-1686a	9b619998bd	Merge pull request #2816 from evance-br/fix-closing-paren-elastic-range	2026-01-27 17:00:08 +01:00
Evance Soumaoro	765c448945	uncomment commented code when testing	2026-01-27 13:19:41 +00:00
Evance Soumaoro	943594ebaa	uncomment commented code when testing	2026-01-27 13:08:38 +00:00
Evance Soumaoro	df17daae0d	fix closing parenthesis error on elastic range queries for lenient parser	2026-01-27 13:01:14 +00:00
Paul Masurel	0ae94baef5	Remove temp file (#2815 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-27 09:22:11 +01:00
Paul Masurel	3f448ecf79	Bugfix on intersection. (#2812 ) The intersection algorithm made it possible for .seek(..) with values lower than the current doc id, breaking the DocSet contract. The fix removes the optimization that caused left.seek(..) to be replaced by a simpler left.advance(..). Simply doing so lead to a performance regression. I therefore integrated that idea within SegmentPostings.seek. We now attempt to check the next doc systematically on seek, PROVIDED the block is already loaded. Closes #2811 Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-27 09:21:09 +01:00
Paul Masurel	b86caeefe2	Major bugfix in intersection A bug was added with the `seek_into_the_danger_zone()` optimization (Spotted and fixed by Stu) The contract says seek_into_the_danger_zone returns true if do is part of the docset. The blanket implementation goes like this. ``` let current_doc = self.doc(); if current_doc < target { self.seek(target); } self.doc() == target ``` So it will return true if target is TERMINATED, where really TERMINATED does not belong to the docset. The fix tries to clarify the contracts and fixes the intersection algorithm. We observe a small but all over the board improvement in intersection performance. --------- Co-authored-by: Stu Hood <stuhood@gmail.com> Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-23 18:44:10 +01:00
ChangRui-Ryan	abf1e64f4d	add benchmark for string search and get (#2795 )	2026-01-19 11:50:41 +01:00
trinity-1686a	12977bc7c4	upgrade some dependancies (#2802 ) including rand, which had a few breaking changes	2026-01-14 10:19:09 +01:00
trinity-1686a	0c94eb94c3	Merge pull request #2799 from jollygreenlaser/lru	2026-01-13 22:47:35 +01:00
Paul Masurel	c92e831dde	Minor refactoring in PostingsSerializer (#2801 ) Removes the Write generics argument in PostingsSerializer. This removes useless generic. Prepares the path for codecs. Removes one useless CountingWrite layer. etc. Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-12 13:53:43 +01:00
Alex Lazar	947c0d5f40	Bump lru to 0.16.3 per dependabot	2026-01-09 23:25:51 -08:00