feat(aggregation): add keys() accessor to IntermediateAggregationResults

resolve conflcit
feat(aggregation): add public accessors for intermediate aggregation results
2026-02-10 10:00:37 +00:00 · 2026-02-09 15:38:35 -05:00 · 2026-02-06 14:23:11 -05:00 · 2026-02-06 11:12:20 -05:00 · 2026-02-06 10:28:59 -05:00 · 2026-01-30 17:06:41 +01:00
22 changed files with 397 additions and 117 deletions
--- a/.claude/skills/rationalize-deps/SKILL.md
+++ b/.claude/skills/rationalize-deps/SKILL.md
@@ -0,0 +1,125 @@
+---
+name: rationalize-deps
+description: Analyze Cargo.toml dependencies and attempt to remove unused features to reduce compile times and binary size
+---
+
+# Rationalize Dependencies
+
+This skill analyzes Cargo.toml dependencies to identify and remove unused features.
+
+## Overview
+
+Many crates enable features by default that may not be needed. This skill:
+1. Identifies dependencies with default features enabled
+2. Tests if `default-features = false` works
+3. Identifies which specific features are actually needed
+4. Verifies compilation after changes
+
+## Step 1: Identify the target
+
+Ask the user which crate(s) to analyze:
+- A specific crate name (e.g., "tokio", "serde")
+- A specific workspace member (e.g., "quickwit-search")
+- "all" to scan the entire workspace
+
+## Step 2: Analyze current dependencies
+
+For the workspace Cargo.toml (`quickwit/Cargo.toml`), list dependencies that:
+- Do NOT have `default-features = false`
+- Have default features that might be unnecessary
+
+Run: `cargo tree -p <crate> -f "{p} {f}" --edges features` to see what features are actually used.
+
+## Step 3: For each candidate dependency
+
+### 3a: Check the crate's default features
+
+Look up the crate on crates.io or check its Cargo.toml to understand:
+- What features are enabled by default
+- What each feature provides
+
+Use: `cargo metadata --format-version=1 | jq '.packages[] | select(.name == "<crate>") | .features'`
+
+### 3b: Try disabling default features
+
+Modify the dependency in `quickwit/Cargo.toml`:
+
+From:
+```toml
+some-crate = { version = "1.0" }
+```
+
+To:
+```toml
+some-crate = { version = "1.0", default-features = false }
+```
+
+### 3c: Run cargo check
+
+Run: `cargo check --workspace` (or target specific packages for faster feedback)
+
+If compilation fails:
+1. Read the error messages to identify which features are needed
+2. Add only the required features explicitly:
+   ```toml
+   some-crate = { version = "1.0", default-features = false, features = ["needed-feature"] }
+   ```
+3. Re-run cargo check
+
+### 3d: Binary search for minimal features
+
+If there are many default features, use binary search:
+1. Start with no features
+2. If it fails, add half the default features
+3. Continue until you find the minimal set
+
+## Step 4: Document findings
+
+For each dependency analyzed, report:
+- Original configuration
+- New configuration (if changed)
+- Features that were removed
+- Any features that are required
+
+## Step 5: Verify full build
+
+After all changes, run:
+```bash
+cargo check --workspace --all-targets
+cargo test --workspace --no-run
+```
+
+## Common Patterns
+
+### Serde
+Often only needs `derive`:
+```toml
+serde = { version = "1.0", default-features = false, features = ["derive", "std"] }
+```
+
+### Tokio
+Identify which runtime features are actually used:
+```toml
+tokio = { version = "1.0", default-features = false, features = ["rt-multi-thread", "macros", "sync"] }
+```
+
+### Reqwest
+Often doesn't need all TLS backends:
+```toml
+reqwest = { version = "0.11", default-features = false, features = ["rustls-tls", "json"] }
+```
+
+## Rollback
+
+If changes cause issues:
+```bash
+git checkout quickwit/Cargo.toml
+cargo check --workspace
+```
+
+## Tips
+
+- Start with large crates that have many default features (tokio, reqwest, hyper)
+- Use `cargo bloat --crates` to identify large dependencies
+- Check `cargo tree -d` for duplicate dependencies that might indicate feature conflicts
+- Some features are needed only for tests - consider using `[dev-dependencies]` features
--- a/.claude/skills/simple-pr/SKILL.md
+++ b/.claude/skills/simple-pr/SKILL.md
@@ -0,0 +1,60 @@
+---
+name: simple-pr
+description: Create a simple PR from staged changes with an auto-generated commit message
+disable-model-invocation: true
+---
+
+# Simple PR
+
+Follow these steps to create a simple PR from staged changes:
+
+## Step 1: Check workspace state
+
+Run: `git status`
+
+Verify that all changes have been staged (no unstaged changes). If there are unstaged changes, abort and ask the user to stage their changes first with `git add`.
+
+Also verify that we are on the `main` branch. If not, abort and ask the user to switch to main first.
+
+## Step 2: Ensure main is up to date
+
+Run: `git pull origin main`
+
+This ensures we're working from the latest code.
+
+## Step 3: Review staged changes
+
+Run: `git diff --cached`
+
+Review the staged changes to understand what the PR will contain.
+
+## Step 4: Generate commit message
+
+Based on the staged changes, generate a concise commit message (1-2 sentences) that describes the "why" rather than the "what".
+
+Display the proposed commit message to the user and ask for confirmation before proceeding.
+
+## Step 5: Create a new branch
+
+Get the git username: `git config user.name | tr ' ' '-' | tr '[:upper:]' '[:lower:]'`
+
+Create a short, descriptive branch name based on the changes (e.g., `fix-typo-in-readme`, `add-retry-logic`, `update-deps`).
+
+Create and checkout the branch: `git checkout -b {username}/{short-descriptive-name}`
+
+## Step 6: Commit changes
+
+Commit with the message from step 3:
+```
+git commit -m "{commit-message}"
+```
+
+## Step 7: Push and open a PR
+
+Push the branch and open a PR:
+```
+git push -u origin {branch-name}
+gh pr create --title "{commit-message-title}" --body "{longer-description-if-needed}"
+```
+
+Report the PR URL to the user when complete.
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -15,7 +15,7 @@ rust-version = "1.85"
 exclude = ["benches/*.json", "benches/*.txt"]

 [dependencies]
-oneshot = "0.1.7"
+oneshot = "0.1.13"
 base64 = "0.22.0"
 byteorder = "1.4.3"
 crc32fast = "1.3.2"
--- a/doc/src/json.md
+++ b/doc/src/json.md
@@ -60,7 +60,7 @@ At indexing, tantivy will try to interpret number and strings as different type
 priority order.

 Numbers will be interpreted as u64, i64 and f64 in that order.
-Strings will be interpreted as rfc3999 dates or simple strings.
+Strings will be interpreted as rfc3339 dates or simple strings.

 The first working type is picked and is the only term that is emitted for indexing.
 Note this interpretation happens on a per-document basis, and there is no effort to try to sniff
@@ -81,7 +81,7 @@ Will be interpreted as
 (my_path.my_segment, String, 233) or (my_path.my_segment, u64, 233)
 ```

-Likewise, we need to emit two tokens if the query contains an rfc3999 date.
+Likewise, we need to emit two tokens if the query contains an rfc3339 date.
 Indeed the date could have been actually a single token inside the text of a document at ingestion time. Generally speaking, we will always at least emit a string token in query parsing, and sometimes more.

 If one more json field is defined, things get even more complicated.
--- a/query-grammar/src/query_grammar.rs
+++ b/query-grammar/src/query_grammar.rs
@@ -560,7 +560,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
            (
                (
                    value((), tag(">=")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                bound
@@ -574,7 +574,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
                ),
                (
                    value((), tag("<=")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                UserInputBound::Unbounded,
@@ -588,7 +588,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
                ),
                (
                    value((), tag(">")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                bound
@@ -602,7 +602,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
                ),
                (
                    value((), tag("<")),
-                    map(word_infallible("", false), |(bound, err)| {
+                    map(word_infallible(")", false), |(bound, err)| {
                        (
                            (
                                UserInputBound::Unbounded,
@@ -1323,6 +1323,14 @@ mod test {
        test_parse_query_to_ast_helper("<a", "{\"*\" TO \"a\"}");
        test_parse_query_to_ast_helper("<=a", "{\"*\" TO \"a\"]");
        test_parse_query_to_ast_helper("<=bsd", "{\"*\" TO \"bsd\"]");
+
+        test_parse_query_to_ast_helper("(<=42)", "{\"*\" TO \"42\"]");
+        test_parse_query_to_ast_helper("(<=42 )", "{\"*\" TO \"42\"]");
+        test_parse_query_to_ast_helper("(age:>5)", "\"age\":{\"5\" TO \"*\"}");
+        test_parse_query_to_ast_helper(
+            "(title:bar AND age:>12)",
+            "(+\"title\":bar +\"age\":{\"12\" TO \"*\"})",
+        );
    }

    #[test]
--- a/src/aggregation/intermediate_agg_result.rs
+++ b/src/aggregation/intermediate_agg_result.rs
@@ -90,6 +90,19 @@ impl From<IntermediateKey> for Key {

 impl Eq for IntermediateKey {}

+impl std::fmt::Display for IntermediateKey {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        match self {
+            IntermediateKey::Str(val) => f.write_str(val),
+            IntermediateKey::F64(val) => f.write_str(&val.to_string()),
+            IntermediateKey::U64(val) => f.write_str(&val.to_string()),
+            IntermediateKey::I64(val) => f.write_str(&val.to_string()),
+            IntermediateKey::Bool(val) => f.write_str(&val.to_string()),
+            IntermediateKey::IpAddr(val) => f.write_str(&val.to_string()),
+        }
+    }
+}
+
 impl std::hash::Hash for IntermediateKey {
    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
        core::mem::discriminant(self).hash(state);
@@ -105,6 +118,21 @@ impl std::hash::Hash for IntermediateKey {
 }

 impl IntermediateAggregationResults {
+    /// Returns a reference to the intermediate aggregation result for the given key.
+    pub fn get(&self, key: &str) -> Option<&IntermediateAggregationResult> {
+        self.aggs_res.get(key)
+    }
+
+    /// Removes and returns the intermediate aggregation result for the given key.
+    pub fn remove(&mut self, key: &str) -> Option<IntermediateAggregationResult> {
+        self.aggs_res.remove(key)
+    }
+
+    /// Returns an iterator over the keys in the intermediate aggregation results.
+    pub fn keys(&self) -> impl Iterator<Item = &String> {
+        self.aggs_res.keys()
+    }
+
    /// Add a result
    pub fn push(&mut self, key: String, value: IntermediateAggregationResult) -> crate::Result<()> {
        let entry = self.aggs_res.entry(key);
@@ -639,6 +667,21 @@ pub struct IntermediateTermBucketResult {
 }

 impl IntermediateTermBucketResult {
+    /// Returns a reference to the map of bucket entries keyed by [`IntermediateKey`].
+    pub fn entries(&self) -> &FxHashMap<IntermediateKey, IntermediateTermBucketEntry> {
+        &self.entries
+    }
+
+    /// Returns the count of documents not included in the returned buckets.
+    pub fn sum_other_doc_count(&self) -> u64 {
+        self.sum_other_doc_count
+    }
+
+    /// Returns the upper bound of the error on document counts in the returned buckets.
+    pub fn doc_count_error_upper_bound(&self) -> u64 {
+        self.doc_count_error_upper_bound
+    }
+
    pub(crate) fn into_final_result(
        self,
        req: &TermsAggregation,
@@ -820,7 +863,7 @@ impl IntermediateRangeBucketEntry {
        };

        // If we have a date type on the histogram buckets, we add the `key_as_string` field as
-        // rfc339
+        // rfc3339
        if column_type == Some(ColumnType::DateTime) {
            if let Some(val) = range_bucket_entry.to {
                let key_as_string = format_date(val as i64)?;
--- a/src/aggregation/metric/average.rs
+++ b/src/aggregation/metric/average.rs
@@ -55,6 +55,12 @@ impl IntermediateAverage {
    pub(crate) fn from_stats(stats: IntermediateStats) -> Self {
        Self { stats }
    }
+
+    /// Returns a reference to the underlying [`IntermediateStats`].
+    pub fn stats(&self) -> &IntermediateStats {
+        &self.stats
+    }
+
    /// Merges the other intermediate result into self.
    pub fn merge_fruits(&mut self, other: IntermediateAverage) {
        self.stats.merge_fruits(other.stats);
--- a/src/aggregation/metric/stats.rs
+++ b/src/aggregation/metric/stats.rs
@@ -110,6 +110,16 @@ impl Default for IntermediateStats {
 }

 impl IntermediateStats {
+    /// Returns the number of values collected.
+    pub fn count(&self) -> u64 {
+        self.count
+    }
+
+    /// Returns the sum of all values collected.
+    pub fn sum(&self) -> f64 {
+        self.sum
+    }
+
    /// Merges the other stats intermediate result into self.
    pub fn merge_fruits(&mut self, other: IntermediateStats) {
        self.count += other.count;
--- a/src/directory/mmap_directory/mod.rs
+++ b/src/directory/mmap_directory/mod.rs
@@ -676,7 +676,7 @@ mod tests {
            let num_segments = reader.searcher().segment_readers().len();
            assert!(num_segments <= 4);
            let num_components_except_deletes_and_tempstore =
-                crate::index::SegmentComponent::iterator().len() - 2;
+                crate::index::SegmentComponent::iterator().len() - 1;
            let max_num_mmapped = num_components_except_deletes_and_tempstore * num_segments;
            assert_eventually(|| {
                let num_mmapped = mmap_directory.get_cache_info().mmapped.len();
--- a/src/docset.rs
+++ b/src/docset.rs
@@ -65,8 +65,8 @@ pub trait DocSet: Send {
    ///   `seek_danger(..)` until it returns `Found`, and get back to a valid state.
    ///
    /// `seek_lower_bound` can be any `DocId` (in the docset or not) as long as it is in
-    /// `(target .. seek_result]` where `seek_result` is the first document in the docset greater
-    /// than to `target`.
+    /// `(target .. seek_result] U {TERMINATED}` where `seek_result` is the first document in the
+    /// docset greater than to `target`.
    ///
    /// `seek_danger` may return `SeekLowerBound(TERMINATED)`.
    ///
@@ -98,7 +98,7 @@ pub trait DocSet: Send {
        if doc == target {
            SeekDangerResult::Found
        } else {
-            SeekDangerResult::SeekLowerBound(self.doc())
+            SeekDangerResult::SeekLowerBound(doc)
        }
    }

--- a/src/index/index_meta.rs
+++ b/src/index/index_meta.rs
@@ -1,8 +1,6 @@
 use std::collections::HashSet;
 use std::fmt;
 use std::path::PathBuf;
-use std::sync::atomic::AtomicBool;
-use std::sync::Arc;

 use serde::{Deserialize, Serialize};

@@ -37,7 +35,6 @@ impl SegmentMetaInventory {
        let inner = InnerSegmentMeta {
            segment_id,
            max_doc,
-            include_temp_doc_store: Arc::new(AtomicBool::new(true)),
            deletes: None,
        };
        SegmentMeta::from(self.inventory.track(inner))
@@ -85,15 +82,6 @@ impl SegmentMeta {
        self.tracked.segment_id
    }

-    /// Removes the Component::TempStore from the alive list and
-    /// therefore marks the temp docstore file to be deleted by
-    /// the garbage collection.
-    pub fn untrack_temp_docstore(&self) {
-        self.tracked
-            .include_temp_doc_store
-            .store(false, std::sync::atomic::Ordering::Relaxed);
-    }
-
    /// Returns the number of deleted documents.
    pub fn num_deleted_docs(&self) -> u32 {
        self.tracked
@@ -111,20 +99,9 @@ impl SegmentMeta {
    /// is by removing all files that have been created by tantivy
    /// and are not used by any segment anymore.
    pub fn list_files(&self) -> HashSet<PathBuf> {
-        if self
-            .tracked
-            .include_temp_doc_store
-            .load(std::sync::atomic::Ordering::Relaxed)
-        {
-            SegmentComponent::iterator()
-                .map(|component| self.relative_path(*component))
-                .collect::<HashSet<PathBuf>>()
-        } else {
-            SegmentComponent::iterator()
-                .filter(|comp| *comp != &SegmentComponent::TempStore)
-                .map(|component| self.relative_path(*component))
-                .collect::<HashSet<PathBuf>>()
-        }
+        SegmentComponent::iterator()
+            .map(|component| self.relative_path(*component))
+            .collect::<HashSet<PathBuf>>()
    }

    /// Returns the relative path of a component of our segment.
@@ -138,7 +115,6 @@ impl SegmentMeta {
            SegmentComponent::Positions => ".pos".to_string(),
            SegmentComponent::Terms => ".term".to_string(),
            SegmentComponent::Store => ".store".to_string(),
-            SegmentComponent::TempStore => ".store.temp".to_string(),
            SegmentComponent::FastFields => ".fast".to_string(),
            SegmentComponent::FieldNorms => ".fieldnorm".to_string(),
            SegmentComponent::Delete => format!(".{}.del", self.delete_opstamp().unwrap_or(0)),
@@ -183,7 +159,6 @@ impl SegmentMeta {
            segment_id: inner_meta.segment_id,
            max_doc,
            deletes: None,
-            include_temp_doc_store: Arc::new(AtomicBool::new(true)),
        });
        SegmentMeta { tracked }
    }
@@ -202,7 +177,6 @@ impl SegmentMeta {
        let tracked = self.tracked.map(move |inner_meta| InnerSegmentMeta {
            segment_id: inner_meta.segment_id,
            max_doc: inner_meta.max_doc,
-            include_temp_doc_store: Arc::new(AtomicBool::new(true)),
            deletes: Some(delete_meta),
        });
        SegmentMeta { tracked }
@@ -214,14 +188,6 @@ struct InnerSegmentMeta {
    segment_id: SegmentId,
    max_doc: u32,
    pub deletes: Option<DeleteMeta>,
-    /// If you want to avoid the SegmentComponent::TempStore file to be covered by
-    /// garbage collection and deleted, set this to true. This is used during merge.
-    #[serde(skip)]
-    #[serde(default = "default_temp_store")]
-    pub(crate) include_temp_doc_store: Arc<AtomicBool>,
-}
-fn default_temp_store() -> Arc<AtomicBool> {
-    Arc::new(AtomicBool::new(false))
 }

 impl InnerSegmentMeta {
--- a/src/index/segment_component.rs
+++ b/src/index/segment_component.rs
@@ -23,8 +23,6 @@ pub enum SegmentComponent {
    /// Accessing a document from the store is relatively slow, as it
    /// requires to decompress the entire block it belongs to.
    Store,
-    /// Temporary storage of the documents, before streamed to `Store`.
-    TempStore,
    /// Bitset describing which document of the segment is alive.
    /// (It was representing deleted docs but changed to represent alive docs from v0.17)
    Delete,
@@ -33,14 +31,13 @@ pub enum SegmentComponent {
 impl SegmentComponent {
    /// Iterates through the components.
    pub fn iterator() -> slice::Iter<'static, SegmentComponent> {
-        static SEGMENT_COMPONENTS: [SegmentComponent; 8] = [
+        static SEGMENT_COMPONENTS: [SegmentComponent; 7] = [
            SegmentComponent::Postings,
            SegmentComponent::Positions,
            SegmentComponent::FastFields,
            SegmentComponent::FieldNorms,
            SegmentComponent::Terms,
            SegmentComponent::Store,
-            SegmentComponent::TempStore,
            SegmentComponent::Delete,
        ];
        SEGMENT_COMPONENTS.iter()
--- a/src/indexer/index_writer.rs
+++ b/src/indexer/index_writer.rs
@@ -218,7 +218,7 @@ fn index_documents<D: Document>(
    let alive_bitset_opt = apply_deletes(&segment_with_max_doc, &mut delete_cursor, &doc_opstamps)?;

    let meta = segment_with_max_doc.meta().clone();
-    meta.untrack_temp_docstore();
+
    // update segment_updater inventory to remove tempstore
    let segment_entry = SegmentEntry::new(meta, delete_cursor, alive_bitset_opt);
    segment_updater.schedule_add_segment(segment_entry).wait()?;
--- a/src/postings/block_segment_postings.rs
+++ b/src/postings/block_segment_postings.rs
@@ -303,10 +303,10 @@ impl BlockSegmentPostings {
    }

    pub(crate) fn load_block(&mut self) {
-        let offset = self.skip_reader.byte_offset();
        if self.block_is_loaded() {
            return;
        }
+        let offset = self.skip_reader.byte_offset();
        match self.skip_reader.block_info() {
            BlockInfo::BitPacked {
                doc_num_bits,
--- a/src/postings/segment_postings.rs
+++ b/src/postings/segment_postings.rs
@@ -168,12 +168,20 @@ impl DocSet for SegmentPostings {
        self.doc()
    }

+    #[inline]
    fn seek(&mut self, target: DocId) -> DocId {
        debug_assert!(self.doc() <= target);
        if self.doc() >= target {
            return self.doc();
        }

+        // As an optimization, if the block is already loaded, we can
+        // cheaply check the next doc.
+        self.cur = (self.cur + 1).min(COMPRESSION_BLOCK_SIZE - 1);
+        if self.doc() >= target {
+            return self.doc();
+        }
+
        // Delegate block-local search to BlockSegmentPostings::seek, which returns
        // the in-block index of the first doc >= target.
        self.cur = self.block_cursor.seek(target);
--- a/src/query/boolean_query/boolean_weight.rs
+++ b/src/query/boolean_query/boolean_weight.rs
@@ -291,18 +291,6 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
            }
        };

-        let exclude_scorer_opt: Option<Box<dyn Scorer>> = if exclude_scorers.is_empty() {
-            None
-        } else {
-            let exclude_specialized_scorer: SpecializedScorer =
-                scorer_union(exclude_scorers, DoNothingCombiner::default, num_docs);
-            Some(into_box_scorer(
-                exclude_specialized_scorer,
-                DoNothingCombiner::default,
-                num_docs,
-            ))
-        };
-
        let include_scorer = match (should_scorers, must_scorers) {
            (ShouldScorersCombinationMethod::Ignored, must_scorers) => {
                // No SHOULD clauses (or they were absorbed into MUST).
@@ -380,16 +368,23 @@ impl<TScoreCombiner: ScoreCombiner> BooleanWeight<TScoreCombiner> {
                }
            }
        };
-        if let Some(exclude_scorer) = exclude_scorer_opt {
-            let include_scorer_boxed =
-                into_box_scorer(include_scorer, &score_combiner_fn, num_docs);
-            Ok(SpecializedScorer::Other(Box::new(Exclude::new(
-                include_scorer_boxed,
-                exclude_scorer,
-            ))))
-        } else {
-            Ok(include_scorer)
+        if exclude_scorers.is_empty() {
+            return Ok(include_scorer);
        }
+
+        let include_scorer_boxed = into_box_scorer(include_scorer, &score_combiner_fn, num_docs);
+        let scorer: Box<dyn Scorer> = if exclude_scorers.len() == 1 {
+            let exclude_scorer = exclude_scorers.pop().unwrap();
+            match exclude_scorer.downcast::<TermScorer>() {
+                // Cast to TermScorer succeeded
+                Ok(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, *exclude_scorer)),
+                // We get back the original Box<dyn Scorer>
+                Err(exclude_scorer) => Box::new(Exclude::new(include_scorer_boxed, exclude_scorer)),
+            }
+        } else {
+            Box::new(Exclude::new(include_scorer_boxed, exclude_scorers))
+        };
+        Ok(SpecializedScorer::Other(scorer))
    }
 }

--- a/src/query/exclude.rs
+++ b/src/query/exclude.rs
@@ -1,48 +1,71 @@
-use crate::docset::{DocSet, TERMINATED};
+use crate::docset::{DocSet, SeekDangerResult, TERMINATED};
 use crate::query::Scorer;
 use crate::{DocId, Score};

-#[inline]
-fn is_within<TDocSetExclude: DocSet>(docset: &mut TDocSetExclude, doc: DocId) -> bool {
-    docset.doc() <= doc && docset.seek(doc) == doc
-}
-
-/// Filters a given `DocSet` by removing the docs from a given `DocSet`.
+/// An exclusion set is a set of documents
+/// that should be excluded from a given DocSet.
 ///
-/// The excluding docset has no impact on scoring.
-pub struct Exclude<TDocSet, TDocSetExclude> {
-    underlying_docset: TDocSet,
-    excluding_docset: TDocSetExclude,
+/// It can be a single DocSet, or a Vec of DocSets.
+pub trait ExclusionSet: Send {
+    /// Returns `true` if the given `doc` is in the exclusion set.
+    fn contains(&mut self, doc: DocId) -> bool;
 }

-impl<TDocSet, TDocSetExclude> Exclude<TDocSet, TDocSetExclude>
+impl<TDocSet: DocSet> ExclusionSet for TDocSet {
+    #[inline]
+    fn contains(&mut self, doc: DocId) -> bool {
+        self.seek_danger(doc) == SeekDangerResult::Found
+    }
+}
+
+impl<TDocSet: DocSet> ExclusionSet for Vec<TDocSet> {
+    #[inline]
+    fn contains(&mut self, doc: DocId) -> bool {
+        for docset in self.iter_mut() {
+            if docset.seek_danger(doc) == SeekDangerResult::Found {
+                return true;
+            }
+        }
+        false
+    }
+}
+
+/// Filters a given `DocSet` by removing the docs from an exclusion set.
+///
+/// The excluding docsets have no impact on scoring.
+pub struct Exclude<TDocSet, TExclusionSet> {
+    underlying_docset: TDocSet,
+    exclusion_set: TExclusionSet,
+}
+
+impl<TDocSet, TExclusionSet> Exclude<TDocSet, TExclusionSet>
 where
    TDocSet: DocSet,
-    TDocSetExclude: DocSet,
+    TExclusionSet: ExclusionSet,
 {
    /// Creates a new `ExcludeScorer`
    pub fn new(
        mut underlying_docset: TDocSet,
-        mut excluding_docset: TDocSetExclude,
-    ) -> Exclude<TDocSet, TDocSetExclude> {
+        mut exclusion_set: TExclusionSet,
+    ) -> Exclude<TDocSet, TExclusionSet> {
        while underlying_docset.doc() != TERMINATED {
            let target = underlying_docset.doc();
-            if !is_within(&mut excluding_docset, target) {
+            if !exclusion_set.contains(target) {
                break;
            }
            underlying_docset.advance();
        }
        Exclude {
            underlying_docset,
-            excluding_docset,
+            exclusion_set,
        }
    }
 }

-impl<TDocSet, TDocSetExclude> DocSet for Exclude<TDocSet, TDocSetExclude>
+impl<TDocSet, TExclusionSet> DocSet for Exclude<TDocSet, TExclusionSet>
 where
    TDocSet: DocSet,
-    TDocSetExclude: DocSet,
+    TExclusionSet: ExclusionSet,
 {
    fn advance(&mut self) -> DocId {
        loop {
@@ -50,7 +73,7 @@ where
            if candidate == TERMINATED {
                return TERMINATED;
            }
-            if !is_within(&mut self.excluding_docset, candidate) {
+            if !self.exclusion_set.contains(candidate) {
                return candidate;
            }
        }
@@ -61,7 +84,7 @@ where
        if candidate == TERMINATED {
            return TERMINATED;
        }
-        if !is_within(&mut self.excluding_docset, candidate) {
+        if !self.exclusion_set.contains(candidate) {
            return candidate;
        }
        self.advance()
@@ -79,10 +102,10 @@ where
    }
 }

-impl<TScorer, TDocSetExclude> Scorer for Exclude<TScorer, TDocSetExclude>
+impl<TScorer, TExclusionSet> Scorer for Exclude<TScorer, TExclusionSet>
 where
    TScorer: Scorer,
-    TDocSetExclude: DocSet + 'static,
+    TExclusionSet: ExclusionSet + 'static,
 {
    #[inline]
    fn score(&mut self) -> Score {
--- a/src/query/intersection.rs
+++ b/src/query/intersection.rs
@@ -84,6 +84,14 @@ impl<TDocSet: DocSet> Intersection<TDocSet, TDocSet> {
        docsets.sort_by_key(|docset| docset.cost());
        go_to_first_doc(&mut docsets);
        let left = docsets.remove(0);
+        debug_assert!({
+            let doc = left.doc();
+            if doc == TERMINATED {
+                true
+            } else {
+                docsets.iter().all(|docset| docset.doc() == doc)
+            }
+        });
        let right = docsets.remove(0);
        Intersection {
            left,
@@ -112,30 +120,24 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
        // Invariant:
        // - candidate is always <= to the next document in the intersection.
        // - candidate strictly increases at every occurence of the loop.
-        let mut candidate = 0;
+        let mut candidate = left.doc() + 1;

        // Termination: candidate strictly increases.
        'outer: while candidate < TERMINATED {
            // As we enter the loop, we should always have candidate < next_doc.

-            // This step always increases candidate.
-            //
-            // TODO: Think about which value would make sense here
-            // It depends on the DocSet implementation, when a seek would outweigh an advance.
-            candidate = if candidate > left.doc().wrapping_add(100) {
-                left.seek(candidate)
-            } else {
-                left.advance()
-            };
+            candidate = left.seek(candidate);

            // Left is positionned on `candidate`.
            debug_assert_eq!(left.doc(), candidate);

            if let SeekDangerResult::SeekLowerBound(seek_lower_bound) = right.seek_danger(candidate)
            {
-                // The max is technically useless but it makes the invariant
-                // easier to proofread.
-                debug_assert!(seek_lower_bound >= candidate);
+                debug_assert!(
+                    seek_lower_bound == TERMINATED || seek_lower_bound > candidate,
+                    "seek_lower_bound {seek_lower_bound} must be greater than candidate \
+                     {candidate}"
+                );
                candidate = seek_lower_bound;
                continue;
            }
@@ -148,7 +150,11 @@ impl<TDocSet: DocSet, TOtherDocSet: DocSet> DocSet for Intersection<TDocSet, TOt
                    other.seek_danger(candidate)
                {
                    // One of the scorer does not match, let's restart at the top of the loop.
-                    debug_assert!(seek_lower_bound >= candidate);
+                    debug_assert!(
+                        seek_lower_bound == TERMINATED || seek_lower_bound > candidate,
+                        "seek_lower_bound {seek_lower_bound} must be greater than candidate \
+                         {candidate}"
+                    );
                    candidate = seek_lower_bound;
                    continue 'outer;
                }
@@ -238,9 +244,12 @@ mod tests {
    use proptest::prelude::*;

    use super::Intersection;
+    use crate::collector::Count;
    use crate::docset::{DocSet, TERMINATED};
    use crate::postings::tests::test_skip_against_unoptimized;
-    use crate::query::VecDocSet;
+    use crate::query::{QueryParser, VecDocSet};
+    use crate::schema::{Schema, TEXT};
+    use crate::Index;

    #[test]
    fn test_intersection() {
@@ -411,4 +420,29 @@ mod tests {
            assert_eq!(intersection.doc(), TERMINATED);
        }
    }
+
+    #[test]
+    fn test_bug_2811_intersection_candidate_should_increase() {
+        let mut schema_builder = Schema::builder();
+        let text_field = schema_builder.add_text_field("text", TEXT);
+        let schema = schema_builder.build();
+
+        let index = Index::create_in_ram(schema);
+        let mut writer = index.writer_for_tests().unwrap();
+        writer
+            .add_document(doc!(text_field=>"hello happy tax"))
+            .unwrap();
+        writer.add_document(doc!(text_field=>"hello")).unwrap();
+        writer.add_document(doc!(text_field=>"hello")).unwrap();
+        writer.add_document(doc!(text_field=>"happy tax")).unwrap();
+
+        writer.commit().unwrap();
+        let query_parser = QueryParser::for_index(&index, Vec::new());
+        let query = query_parser
+            .parse_query(r#"+text:hello +text:"happy tax""#)
+            .unwrap();
+        let searcher = index.reader().unwrap().searcher();
+        let c = searcher.search(&*query, &Count).unwrap();
+        assert_eq!(c, 1);
+    }
 }
--- a/src/query/mod.rs
+++ b/src/query/mod.rs
@@ -43,7 +43,7 @@ pub use self::boost_query::{BoostQuery, BoostWeight};
 pub use self::const_score_query::{ConstScoreQuery, ConstScorer};
 pub use self::disjunction_max_query::DisjunctionMaxQuery;
 pub use self::empty_query::{EmptyQuery, EmptyScorer, EmptyWeight};
-pub use self::exclude::Exclude;
+pub use self::exclude::{Exclude, ExclusionSet};
 pub use self::exist_query::ExistsQuery;
 pub use self::explanation::Explanation;
 #[cfg(test)]
--- a/src/query/phrase_query/phrase_scorer.rs
+++ b/src/query/phrase_query/phrase_scorer.rs
@@ -531,7 +531,12 @@ impl<TPostings: Postings> DocSet for PhraseScorer<TPostings> {
    }

    fn seek_danger(&mut self, target: DocId) -> SeekDangerResult {
-        debug_assert!(target >= self.doc());
+        debug_assert!(
+            target >= self.doc(),
+            "target ({}) should be greater than or equal to doc ({})",
+            target,
+            self.doc()
+        );
        let seek_res = self.intersection_docset.seek_danger(target);
        if seek_res != SeekDangerResult::Found {
            return seek_res;
--- a/src/query/term_query/term_scorer.rs
+++ b/src/query/term_query/term_scorer.rs
@@ -105,6 +105,7 @@ impl DocSet for TermScorer {

    #[inline]
    fn seek(&mut self, target: DocId) -> DocId {
+        debug_assert!(target >= self.doc());
        self.postings.seek(target)
    }

--- a/src/space_usage/mod.rs
+++ b/src/space_usage/mod.rs
@@ -124,7 +124,6 @@ impl SegmentSpaceUsage {
            FieldNorms => PerField(self.fieldnorms().clone()),
            Terms => PerField(self.termdict().clone()),
            SegmentComponent::Store => ComponentSpaceUsage::Store(self.store().clone()),
-            SegmentComponent::TempStore => ComponentSpaceUsage::Store(self.store().clone()),
            Delete => Basic(self.deletes()),
        }
    }
Author	SHA1	Message	Date
cong.xie	bb141abe22	feat(aggregation): add keys() accessor to IntermediateAggregationResults	2026-02-09 15:38:35 -05:00
cong.xie	f1c29ba972	resolve conflcit	2026-02-06 14:23:11 -05:00
cong.xie	ae0554a6a5	feat(aggregation): add public accessors for intermediate aggregation results Add accessor methods to allow external crates to read intermediate aggregation results without accessing pub(crate) fields: - IntermediateAggregationResults: get(), remove() - IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound() - IntermediateAverage: stats() - IntermediateStats: count(), sum() - IntermediateKey: Display impl for string conversion	2026-02-06 11:12:20 -05:00
cong.xie	0d7abe5d23	feat(aggregation): add public accessors for intermediate aggregation results Add accessor methods to allow external crates to read intermediate aggregation results without accessing pub(crate) fields: - IntermediateAggregationResults: get(), get_mut(), remove() - IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound() - IntermediateAverage: stats() - IntermediateStats: count(), sum() - IntermediateKey: Display impl for string conversion	2026-02-06 10:28:59 -05:00
PSeitz	98ebbf922d	faster exclude queries (#2825 ) * faster exclude queries Faster exclude queries with multiple terms. Changes `Exclude` to be able to exclude multiple DocSets, instead of putting the docsets into a union. Use `seek_danger` in `Exclude`. closes #2822 * replace unwrap with match	2026-01-30 17:06:41 +01:00
Paul Masurel	4a89e74597	Fix rfc3339 typos and add Claude Code skills (#2823 ) Closes #2817	2026-01-30 12:00:28 +01:00
Alex Lazar	4d99e51e50	Bump oneshot to 0.1.13 per dependabot (#2821 )	2026-01-30 11:42:01 +01:00
trinity-1686a	9b619998bd	Merge pull request #2816 from evance-br/fix-closing-paren-elastic-range	2026-01-27 17:00:08 +01:00
Evance Soumaoro	765c448945	uncomment commented code when testing	2026-01-27 13:19:41 +00:00
Evance Soumaoro	943594ebaa	uncomment commented code when testing	2026-01-27 13:08:38 +00:00
Evance Soumaoro	df17daae0d	fix closing parenthesis error on elastic range queries for lenient parser	2026-01-27 13:01:14 +00:00
Paul Masurel	0ae94baef5	Remove temp file (#2815 ) Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-27 09:22:11 +01:00
Paul Masurel	3f448ecf79	Bugfix on intersection. (#2812 ) The intersection algorithm made it possible for .seek(..) with values lower than the current doc id, breaking the DocSet contract. The fix removes the optimization that caused left.seek(..) to be replaced by a simpler left.advance(..). Simply doing so lead to a performance regression. I therefore integrated that idea within SegmentPostings.seek. We now attempt to check the next doc systematically on seek, PROVIDED the block is already loaded. Closes #2811 Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>	2026-01-27 09:21:09 +01:00