Commit Graph

3465 Commits

Author SHA1 Message Date
Pascal Seitz
0bdec77410 pub method on Term 2026-02-24 13:31:51 +01:00
Pascal Seitz
1a1c29c785 allow Searcher to be constructed without index 2026-02-17 17:56:33 +01:00
Pascal Seitz
8a16afa2f1 add to_json->Value method 2026-02-17 13:21:23 +01:00
Pascal Seitz
e841cebba4 convert StoreReader to trait
this will remove the DocumentDeserialize (maybe added later in a
different form)
2026-02-16 17:33:49 +01:00
Pascal Seitz
05f255b757 add async methods for quickwit 2026-02-16 10:32:32 +01:00
Pascal Seitz
e6318e1591 add comments, remove fieldnorms 2026-02-12 14:13:33 +01:00
Pascal Seitz
70bb97231b remove fieldnorms_readers 2026-02-11 19:40:45 +01:00
Paul Masurel
6038455761 First stab at tantivy's codec
Convert SegmentReader, InvertedIndexReader and postinglists to traits.
Add special functions to pushdown certain performance methods to keep
them strictly typed.

We rely on a ObjectSafeCodec contraption to avoid the proliferation of generics.
That object's point is to make sure we can build TermScorer with a concrete
codec specific type before reboxing it. (same thing for PhraseScorer).

fix performance regression: fix incorrect scorer cast for buffered union
bock wand
2026-02-11 15:11:29 +01:00
PSeitz
57fe659fff make serializer pub (#2835)
some changes on the posting list serializer to make it usable in
other contexts.

Improve errors

Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>
2026-02-11 14:37:42 +01:00
trinity-1686a
5562ce6037 Merge pull request #2818 from Darkheir/fix/query_grammar_regex_between_parentheses 2026-02-11 11:39:58 +01:00
Metin Dumandag
09b6ececa7 Export fields of the PercentileValuesVecEntry (#2833)
Otherwise, there is no way to access these fields when not using the
json serialized form of the aggregation results.

This simple data struct is part of the public api,
so its fields should be accessible as well.
2026-02-11 11:31:07 +01:00
Moe
8018016e46 feat: add fast field support for Bytes type (#100) (#2830)
## What

Enable range queries and TopN sorting on `Bytes` fast fields, bringing them to parity with `Str` fields.

## Why

`BytesColumn` uses the same dictionary encoding as `StrColumn` internally, but range queries and TopN sorting were explicitly disabled for `Bytes`. This prevented use cases like storing lexicographically sortable binary data (e.g., arbitrary-precision decimals) that need efficient range filtering.

## How

1. **Enable range queries for Bytes** - Changed `is_type_valid_for_fastfield_range_query()` to return `true` for `Type::Bytes`
2. **Add BytesColumn handling in scorer** - Added a branch in `FastFieldRangeWeight::scorer()` to handle bytes fields using dictionary ordinal lookup (mirrors the existing `StrColumn` logic)
3. **Add SortByBytes** - New sort key computer for TopN queries on bytes columns

## Tests

- `test_bytes_field_ff_range_query` - Tests inclusive/exclusive bounds and unbounded ranges
- `test_sort_by_bytes_asc` / `test_sort_by_bytes_desc` - Tests lexicographic ordering in both directions
2026-02-11 11:26:18 +01:00
trinity-1686a
6bf185dc3f Merge pull request #2829 from quickwit-oss/cong.xie/add-intermediate-accessors 2026-02-10 17:07:24 +01:00
cong.xie
bb141abe22 feat(aggregation): add keys() accessor to IntermediateAggregationResults 2026-02-09 15:38:35 -05:00
cong.xie
f1c29ba972 resolve conflcit 2026-02-06 14:23:11 -05:00
cong.xie
ae0554a6a5 feat(aggregation): add public accessors for intermediate aggregation results
Add accessor methods to allow external crates to read intermediate
aggregation results without accessing pub(crate) fields:

- IntermediateAggregationResults: get(), remove()
- IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound()
- IntermediateAverage: stats()
- IntermediateStats: count(), sum()
- IntermediateKey: Display impl for string conversion
2026-02-06 11:12:20 -05:00
cong.xie
0d7abe5d23 feat(aggregation): add public accessors for intermediate aggregation results
Add accessor methods to allow external crates to read intermediate
aggregation results without accessing pub(crate) fields:

- IntermediateAggregationResults: get(), get_mut(), remove()
- IntermediateTermBucketResult: entries(), sum_other_doc_count(), doc_count_error_upper_bound()
- IntermediateAverage: stats()
- IntermediateStats: count(), sum()
- IntermediateKey: Display impl for string conversion
2026-02-06 10:28:59 -05:00
PSeitz
28db952131 Add regex search and merge segments benchmark (#2826)
* add merge_segments benchmark

* add regex search bench
2026-02-02 17:28:02 +01:00
PSeitz
98ebbf922d faster exclude queries (#2825)
* faster exclude queries

Faster exclude queries with multiple terms.

Changes `Exclude` to be able to exclude multiple DocSets, instead of
putting the docsets into a union.
Use `seek_danger` in `Exclude`.

closes #2822

* replace unwrap with match
2026-01-30 17:06:41 +01:00
Paul Masurel
4a89e74597 Fix rfc3339 typos and add Claude Code skills (#2823)
Closes #2817
2026-01-30 12:00:28 +01:00
Alex Lazar
4d99e51e50 Bump oneshot to 0.1.13 per dependabot (#2821) 2026-01-30 11:42:01 +01:00
Darkheir
a55e4069e4 feat(query-grammar): Apply PR review suggestions
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
2026-01-28 14:13:55 +01:00
Darkheir
1fd30c62be fix(query-grammar): Fix regexes between parentheses
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
2026-01-28 10:37:51 +01:00
trinity-1686a
9b619998bd Merge pull request #2816 from evance-br/fix-closing-paren-elastic-range 2026-01-27 17:00:08 +01:00
Evance Soumaoro
765c448945 uncomment commented code when testing 2026-01-27 13:19:41 +00:00
Evance Soumaoro
943594ebaa uncomment commented code when testing 2026-01-27 13:08:38 +00:00
Evance Soumaoro
df17daae0d fix closing parenthesis error on elastic range queries for lenient parser 2026-01-27 13:01:14 +00:00
Paul Masurel
0ae94baef5 Remove temp file (#2815)
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-01-27 09:22:11 +01:00
Paul Masurel
3f448ecf79 Bugfix on intersection. (#2812)
The intersection algorithm made it possible for .seek(..) with values
lower than the current doc id, breaking the DocSet contract.

The fix removes the optimization that caused left.seek(..) to be replaced
by a simpler left.advance(..).

Simply doing so lead to a performance regression.
I therefore integrated that idea within SegmentPostings.seek.

We now attempt to check the next doc systematically on seek,
PROVIDED the block is already loaded.

Closes #2811

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-01-27 09:21:09 +01:00
Paul Masurel
b86caeefe2 Major bugfix in intersection
A bug was added with the `seek_into_the_danger_zone()` optimization

(Spotted and fixed by Stu)

The contract says seek_into_the_danger_zone returns true if do is part of the docset.

The blanket implementation goes like this.

```
let current_doc = self.doc();
if current_doc < target {
     self.seek(target);
}
self.doc() == target
```

So it will return true if target is TERMINATED, where really TERMINATED does not belong to the docset.


The fix tries to clarify the contracts and fixes the intersection algorithm.
We observe a small but all over the board improvement in intersection performance.

---------

Co-authored-by: Stu Hood <stuhood@gmail.com>
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-01-23 18:44:10 +01:00
ChangRui-Ryan
abf1e64f4d add benchmark for string search and get (#2795) 2026-01-19 11:50:41 +01:00
trinity-1686a
12977bc7c4 upgrade some dependancies (#2802)
including rand, which had a few breaking changes
2026-01-14 10:19:09 +01:00
trinity-1686a
0c94eb94c3 Merge pull request #2799 from jollygreenlaser/lru 2026-01-13 22:47:35 +01:00
Paul Masurel
c92e831dde Minor refactoring in PostingsSerializer (#2801)
Removes the Write generics argument in PostingsSerializer.
This removes useless generic.
Prepares the path for codecs.
Removes one useless CountingWrite layer.
etc.

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-01-12 13:53:43 +01:00
Alex Lazar
947c0d5f40 Bump lru to 0.16.3 per dependabot 2026-01-09 23:25:51 -08:00
Paul Masurel
d904630e6a Bumped bitpacking version (#2797)
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-01-08 15:50:22 +01:00
PSeitz-dd
65b5a1a306 one collector per agg request instead per bucket (#2759)
* improve bench

* add more tests for new collection type

* one collector per agg request instead per bucket

In this refactoring a collector knows in which bucket of the parent
their data is in. This allows to convert the previous approach of one
collector per bucket to one collector per request.

low card bucket optimization

* reduce dynamic dispatch, faster term agg

* use radix map, fix prepare_max_bucket

use paged term map in term agg
use special no sub agg term map impl

* specialize columntype in stats

* remove stacktrace bloat, use &mut helper

increase cache to 2048

* cleanup

remove clone
move data in term req, single doc opt for stats

* add comment

* share column block accessor

* simplify fetch block in column_block_accessor

* split subaggcache into two trait impls

* move partitions to heap

* fix name, add comment

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2026-01-06 11:50:55 +01:00
ChangRui-Ryan
db2ecc6057 fix Column.first method parameter type (#2792) 2026-01-05 10:03:01 +01:00
Paul Masurel
77505c3d03 Making stemming optional. (#2791)
Fixed code and CI to run on no default features.

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2026-01-02 12:40:42 +01:00
PSeitz
735c588f4f fix union performance regression (#2790)
* add inlines

* fix union performance regression

Remove unwrap from hotpath generates better assembly.

closes #2788
2026-01-02 12:06:51 +01:00
PSeitz
242a1531bf fix flaky test (#2784)
Signed-off-by: Pascal Seitz <pascal.seitz@gmail.com>
2026-01-02 11:30:51 +01:00
trinity-1686a
6443b63177 document 1bit hole and some queries supporting running with just fastfield (#2779)
* add small doc on some queries using fast field when not indexed

* document 1 unused bit in skiplist
2026-01-02 10:32:37 +01:00
Stu Hood
4987495ee4 Add an erased SortKeyComputer to sort on types which are not known until runtime (#2770)
* Remove PartialOrd bound on compared values.

* Fix declared `SortKey` type of `impl<..> SortKeyComputer for (HeadSortKeyComputer, TailSortKeyComputer)`

* Add a SortByOwnedValue implementation to provide a type-erased column.

* Add support for comparing mismatched `OwnedValue` types.

* Support JSON columns.

* Refer to https://github.com/quickwit-oss/tantivy/issues/2776

* Rename to `SortByErasedType`.

* Comment on transitivity.

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Fix clippy warnings in new code.

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2026-01-02 10:28:47 +01:00
Paul Masurel
b11605f045 Addressing clippy comments (#2789)
Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-12-31 18:02:00 +01:00
ChangRui-Ryan
75d7989cc6 add benchmark for boolean query with range sub query (#2787) 2025-12-31 12:00:53 +01:00
PSeitz
923f0508f2 seek_exact + cost based intersection (#2538)
* seek_exact + cost based intersection

Adds `seek_exact` and `cost` to `DocSet` for a more efficient intersection.
Unlike `seek`, `seek_exact` does not require the DocSet to advance to the next hit, if the target does not exist.

`cost` allows to address the different DocSet types and their cost
model and is used to determine the DocSet that drives the intersection.
E.g. fast field range queries may do a full scan. Phrase queries load the positions to check if a we have a hit.
They both have a higher cost than their size_hint would suggest.

Improves `size_hint` estimation for intersection and union, by having a
estimation based on random distribution with a co-location factor.

Refactor range query benchmark.

Closes #2531

*Future Work*

Implement `seek_exact` for BufferedUnionScorer and RangeDocSet (fast field range queries)
Evaluate replacing `seek` with `seek_exact` to reduce code complexity

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* add API contract verfication

* impl seek_exact on union

* rename seek_exact

* add mixed AND OR test, fix buffered_union

* Add a proptest of BooleanQuery. (#2690)

* fix build

* Increase the document count.

* fix merge conflict

* fix debug assert

* Fix compilation errors after rebase

- Remove duplicate proptest_boolean_query module
- Remove duplicate cost() method implementations
- Fix TopDocs API usage (add .order_by_score())
- Remove duplicate imports
- Remove unused variable assignments

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
Co-authored-by: Stu Hood <stuhood@gmail.com>
2025-12-30 14:43:25 +01:00
ChangRui-Ryan
e0b62e00ac optimize RangeDocSet for non-overlapping query ranges (#2783) 2025-12-29 16:55:28 +01:00
Stu Hood
ce97beb86f Add support for natural-order-with-none-highest in TopDocs::order_by (#2780)
* Add `ComparatorEnum::NaturalNoneHigher`.

* Fix comments.
2025-12-23 09:22:20 +01:00
Stu Hood
c0f21a45ae Use a strict comparison in TopNComputer (#2777)
* Remove `(Partial)Ord` from `ComparableDoc`, and unify comparison between `TopNComputer` and `Comparator`.

* Doc cleanups.

* Require Ord for `ComparableDoc`.

* Semantics are actually _ascending_ DocId order.

* Adjust docs again for ascending DocId order.

* minor change

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-12-18 12:13:23 +01:00
Moe
73657dff77 fix: fixed integer overflow in ExpUnrolledLinkedList for large datasets (#2735)
* Fixed the overflow issue.

* Fixed lint issues.

* Applied PR fixes.

* Fixed a lint issue.
2025-12-16 22:57:12 +01:00