Commit Graph

2483 Commits

Author SHA1 Message Date
Paul Masurel
1b38a9ba62 Added unit test for aggregation 2025-08-28 12:19:22 +02:00
Paul Masurel
39e027667b per field size details (#2679)
* Added per-field size details.

This also does a bunch of refactoring.

merging field metadata does not silently asserts that arguments should be sorted.
merging does not set `stored`.

We do not rely on a hashmap to group fields, but instead rely on the fact that
the term dictionary is sorted.

The inverted level method that exposes field metadata is not exposed
as public anymore.

* CR comment

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-08-13 13:12:22 +02:00
PSeitz-dd
a1d65c3df3 test stable ordering with pagination (#2683) 2025-08-13 15:36:28 +08:00
trinity-1686a
2e4615c2d3 Merge pull request #2678 from Darkheir/feat/query_grammar_space_between_field_and_value
feat: Support spaces between field name and value
2025-08-11 09:57:23 +02:00
trinity-1686a
c301e7b1c4 Merge pull request #2673 from paradedb/stuhood.fix-order-by-dup-string
Fix `TopDocs::order_by_string_fast_field` for duplicates
2025-07-30 18:25:03 +02:00
Darkheir
d4b090124c feat: Support spaces between field name and value 2025-07-23 11:12:13 +02:00
PSeitz-dd
811c68cdb2 fix field_names in top_hits aggregation (#2675) 2025-07-21 12:19:30 +08:00
trinity-1686a
bc1c789897 Merge pull request #2676 from quickwit-oss/trinity.pointard/allow-partial-default-field-success
ignore failure to parse query when other default field suceeded
2025-07-18 14:20:41 +02:00
trinity Pointard
e7c8c331bd ignore failure to parse query when other default field suceeded 2025-07-17 14:47:28 +02:00
Eric Ridge
2f01152a3c adjust Dictionary::sorted_ords_to_term_cb() to allow duplicates 2025-07-16 13:38:43 -07:00
PSeitz
4e84c70387 Fix TopNComputer for reverse order (#2672)
Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-07-16 21:44:04 +08:00
Paul M.
f2c77f06c5 Update fs4 to latest (0.13.1) (#2654)
- One change was needed to handle the `Result<bool>` that now returns from `try_lock_exclusive`

Co-authored-by: Paul M. <prov223@tutanota.com>
2025-07-14 11:26:19 +08:00
PSeitz
945af922d1 clippy (#2661)
* clippy

* use readable version

---------

Co-authored-by: Pascal Seitz <pascal.seitz@datadoghq.com>
2025-07-02 11:25:03 +02:00
PSeitz-dd
295d07e55c fix union performance regression (#2663)
closes https://github.com/quickwit-oss/tantivy/issues/2656
2025-07-01 20:32:25 +02:00
Stu Hood
a2400f4e73 Add string fast field support to TopDocs. (#2642)
* Add string fast field support to `TopDocs`.

* Remove unnecessary generics, and review feedback.

* Use actual/less-ambiguous cities.

* Review feedback
2025-06-20 10:27:14 +02:00
Zhang.Jinrui
436ec6caea fix typo for the comments of search_with_executor() (#2653)
Co-authored-by: Zhang Jinrui <zhangjinrui@microsoft.com>
2025-06-19 09:53:21 +02:00
PSeitz
2b668bd2bf readability improvement on executor (#2615) 2025-04-08 18:28:49 +02:00
Remi Dettai
b681ec9335 Fix compilation stability 2025-04-01 09:33:33 +02:00
trinity Pointard
9426d5be7b fix agg Key PartialEq impl 2025-03-14 14:57:45 +01:00
Paul Masurel
519e5d2ed1 clippy warnings 2025-03-05 11:15:06 +01:00
Paul Masurel
0afabad494 Cargo fmt 2025-03-05 11:07:46 +01:00
Remi Dettai
89b052cd42 Catch panics during merges (#2582)
* Adding panic handler for the rayon merge thread pool

* Return panic message in error

---------

Co-authored-by: Paul Masurel <paul.masurel@datadoghq.com>
2025-03-05 10:36:48 +01:00
SteveLauC
c48c649436 refactor: use std AtomicU64 and remove wrapper (#2585) 2025-02-24 03:56:15 +01:00
Paul Masurel
58c0739953 Merge pull request #2581 from quickwit-oss/merge_dict_column_repro
use usize in bitpacker
2025-02-21 10:53:07 +09:00
Pascal Seitz
e7daf69de9 use usize in bitpacker
use usize in bitpacker to enable larger columns in the columnar store

Godbolt comparison with u32 vs u64 for get access: https://godbolt.org/z/cjf7nenYP

Add a mini-tool to inspect columnar files created by tantivy. (very basic functionality which can be extended later)
2025-02-20 15:39:10 +01:00
trinity Pointard
0368162ef0 make DateHistogramAggregationReq buildable 2025-02-18 11:45:24 +01:00
trinity-1686a
d281ca3e65 Merge pull request #2559 from quickwit-oss/trinity/sstable-partial-automaton
allow warming partially an sstable for an automaton
2025-01-08 16:35:35 +01:00
trinity Pointard
be17daf658 split iterator 2025-01-08 16:24:34 +01:00
trinity Pointard
6ca84a61fa make termdict always clone 2025-01-08 16:19:54 +01:00
trinity Pointard
037d12c9c9 fix deadlocking on automaton warmup 2025-01-06 11:58:58 +01:00
Remi Dettai
71cf19870b Exist queries match subpath fields (#2558)
* Exist queries match subpath fields

* Make subpath check optional

* Add async subpath listing
2025-01-06 10:17:39 +01:00
trinity Pointard
175a529c41 use executor for cpu-heavy sstable decompression for automaton 2025-01-03 19:14:07 +01:00
trinity Pointard
fe0c7c5408 change rangebound style 2025-01-02 11:56:05 +01:00
Harrison Burt
148594f0f9 Improve IndexWriter customisation via builder (#2562)
* Improve `IndexWriter` customisation via builder

* Remove change noise from PR

* Correct documentation

* Resolve comments and add test
2025-01-02 09:43:22 +01:00
trinity Pointard
dfff5f3bcb rename merge_holes_under => merge_holes_under_bytes 2024-12-23 16:17:44 +01:00
trinity-1686a
ebf4d84553 add comment about cpu-intensive operation in async context 2024-12-20 12:23:49 +01:00
trinity-1686a
a1447cc9c2 remove breaking change in sstable public api 2024-12-19 17:30:05 +01:00
trinity-1686a
c39d91f827 Merge pull request #2547 from quickwit-oss/trinity/count-str
add support for counting non integer in aggregation
2024-12-17 15:27:30 +01:00
trinity Pointard
32b6e9711b add tests 2024-12-13 16:06:24 +01:00
trinity-1686a
24c5dc2398 allow warming up automaton 2024-12-10 13:32:12 +01:00
Pierre Barre
6e02c5cb25 Make NUM_MERGE_THREADS configurable (#2535)
* Make `NUM_MERGE_THREADS` configurable

* Remove unused import

* Reword comment src/index/index.rs

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>

---------

Co-authored-by: PSeitz <PSeitz@users.noreply.github.com>
2024-12-09 16:53:11 +08:00
trinity-1686a
0bac391291 add support for counting non integer in aggregation 2024-11-28 19:52:47 +01:00
Paul Masurel
c35a782747 Updating rustc-hash and clippy fixes (#2532)
* Updating rustc-hash and clippy fixes

* fix terms_aggregation_min_doc_count_special_case

---------

Co-authored-by: Pascal Seitz <pascal.seitz@gmail.com>
2024-11-01 13:46:26 +08:00
PSeitz
21d057059e clippy (#2527)
* clippy

* clippy

* clippy

* clippy

* convert allow to expect and remove unused

* cargo fmt

* cleanup

* export sample

* clippy
2024-10-22 09:26:54 +08:00
PSeitz
dca508b4ca remove read_postings_no_deletes (#2526)
closes #2525
2024-10-22 09:52:43 +09:00
PSeitz
aebae9965d add RegexPhraseQuery (#2516)
* add RegexPhraseQuery

RegexPhraseQuery supports phrase queries with regex. It supports regex
and wildcards. E.g. a query with wildcards:
"b* b* wolf" matches "big bad wolf"
Slop is supported as well:
"b* wolf"~2 matches "big bad wolf"

Regex queries may match a lot of terms where we still need to
keep track which term hit to load the positions.
The phrase query algorithm groups terms by their frequency
together in the union to prefilter groups early.

This PR comes with some new datastructures:

SimpleUnion - A union docset for a list of docsets. It doesn't do any
caching and is therefore well suited for datasets with lots of skipping.
(phrase search, but intersections in general)

LoadedPostings - Like SegmentPostings, but all docs and positions are loaded in
memory. SegmentPostings uses 1840 bytes per instance with its caches,
which is equivalent to 460 docids.
LoadedPostings is used for terms which have less than 100 docs.
LoadedPostings is only used to reduce memory consumption.

BitSetPostingUnion - Creates a `Posting` that uses the bitset for docid
hits and the docsets for positions. The BitSet is the precalculated
union of the docsets
In the RegexPhraseQuery there is a size limit of 512 docsets per PreAggregatedUnion,
before creating a new one.

Renamed Union to BufferedUnionScorer
Added proptests to test different union types.

* cleanup

* use Box instead of Vec

* use RefCell instead of term_freq(&mut)

* remove wildcard mode

* move RefCell to outer

* clippy
2024-10-21 18:29:17 +08:00
PSeitz
2f2db16ec1 store DateTime as nanoseconds in doc store (#2486)
* store DateTime as nanoseconds in doc store

The doc store DateTime was truncated to microseconds previously. This
removes this truncation, while still keeping backwards compatibility.

This is done by adding the trait `ConfigurableBinarySerializable`, which
works like `BinarySerializable`, but with a config that allows de/serialize
as different date time precision currently.

bump version format to 7.
add compat test to check the date time truncation.

* remove configurable binary serialize, add enum for doc store version

* test doc store version ord
2024-10-18 10:50:20 +08:00
Bruce Mitchener
c17e513377 Reduce typo count. (#2510) 2024-10-10 09:55:37 +08:00
Tri
8bd6eb06e6 feat: make SegmentMeta.with_max_doc public (#2499)
* chore: add container

* feat: make max doc editable externally

* chore: expose another method

* chore: remove comments

* remove unused devcontainer

* chore: manually match nightly format

* chore: change weird formating

* revert format change

* fix: format with nightly
2024-09-23 12:39:36 +08:00
PSeitz
55b0b52457 Fix AggregationLimits (#2495)
* change AggregationLimits behavior

This fixes an issue encountered with the current behaviour of
AggregationLimits.
Previously we had AggregationLimits and RessourceLimitGuard, which both
track the memory, but only RessourceLimitGuard released memory when
dropped, while AggregationLimits did not.

This PR changes AggregationLimits to be a guard itself and removes the
RessourceLimitGuard.

* rename AggregationLimits to AggregationLimitsGuard
2024-09-17 14:25:47 +08:00