Commit Graph

3124 Commits

Author SHA1 Message Date
Adam Reichold
19a773da47 Allow cheaply cloning a StoreReader to enable user control over block cache usage. 2023-12-11 10:28:50 +01:00
PSeitz
bff7c58497 improve indexing benchmark (#2275) 2023-12-11 09:04:42 +01:00
trinity-1686a
9ebc5ed053 use fst for sstable index (#2268)
* read path for new fst based index

* implement BlockAddrStoreWriter

* extract slop/derivation computation

* use better linear approximator and allow negative correction to approximator

* document format and reorder some fields

* optimize single block sstable size

* plug backward compat
2023-12-04 15:13:15 +01:00
PSeitz
0b56c88e69 Revert "Preparing for 0.21.2 release." (#2258)
* Revert "Preparing for 0.21.2 release. (#2256)"

This reverts commit 9caab45136.

* bump version to 0.21.1

* set version to 0.22.0-dev
2023-12-01 13:46:12 +01:00
PSeitz
24841f0b2a update bitpacker dep (#2269) 2023-12-01 13:45:52 +01:00
PSeitz
1a9fc10be9 add fields_metadata to SegmentReader, add columnar docs (#2222)
* add fields_metadata to SegmentReader, add columnar docs

* use schema to resolve field, add test

* normalize paths

* merge for FieldsMetadata, add fields_metadata on Index

* Update src/core/segment_reader.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* merge code paths

* add Hash

* move function oustide

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-11-22 12:29:53 +01:00
PSeitz
07573a7f19 update fst (#2267)
update fst to 0.5 (deduplicates regex-syntax in the dep tree)
deps cleanup
2023-11-21 16:06:57 +01:00
BlackHoleFox
daad2dc151 Take string references instead of owned values building Facet paths (#2265) 2023-11-20 09:40:44 +01:00
PSeitz
054f49dc31 support escaped dot, add agg test (#2250)
add agg test for nested JSON
allow escaping of dot
2023-11-20 03:00:57 +01:00
PSeitz
47009ed2d3 remove unused deps (#2264)
found with cargo machete
remove pprof (doesn't work)
2023-11-20 02:59:59 +01:00
PSeitz
0aae31d7d7 reduce number of allocations (#2257)
* reduce number of allocations

Explanation makes up around 50% of all allocations (numbers not perf).
It's created during serialization but not called.

- Make Explanation optional in BM25
- Avoid allocations when using Explanation

* use Cow
2023-11-16 13:47:36 +01:00
Paul Masurel
9caab45136 Preparing for 0.21.2 release. (#2256) 2023-11-15 10:43:36 +09:00
Chris Tam
6d9a7b7eb0 Derive Debug for SchemaBuilder (#2254) 2023-11-15 01:03:44 +01:00
dependabot[bot]
7a2c5804b1 Update itertools requirement from 0.11.0 to 0.12.0 (#2255)
Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.11.0...v0.12.0)

---
updated-dependencies:
- dependency-name: itertools
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-15 01:03:08 +01:00
François Massot
5319977171 Merge pull request #2253 from quickwit-oss/issue/2251-bug-merge-json-object-with-number
Fix bug occuring when merging JSON object indexed with positions.
2023-11-14 17:28:29 +01:00
trinity-1686a
828632e8c4 rustfmt 2023-11-14 15:05:16 +01:00
Paul Masurel
6b59ec6fd5 Fix bug occuring when merging JSON object indexed with positions.
In JSON Object field the presence of term frequencies depend on the
field.
Typically, a string with postiions indexed will have positions
while numbers won't.

The presence or absence of term freqs for a given term is unfortunately
encoded in a very passive way.

It is given by the presence of extra information in the skip info, or
the lack of term freqs after decoding vint blocks.

Before, after writing a segment, we would encode the segment correctly
(without any term freq for number in json object field).
However during merge, we would get the default term freq=1 value.
(this is default in the absence of encoded term freqs)

The merger would then proceed and attempt to decode 1 position when
there are in fact none.

This PR requires to explictly tell the posting serialize whether
term frequencies should be serialized for each new term.

Closes #2251
2023-11-14 22:41:48 +09:00
PSeitz
b60d862150 docid deltas while indexing (#2249)
* docid deltas while indexing

storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.

HDFS Indexing 1.1GB Total memory consumption:
Before:  760 MB
Now:     590 MB

* use scan for delta decoding
2023-11-13 05:14:27 +01:00
PSeitz
4837c7811a add missing inlines (#2245) 2023-11-10 08:00:42 +01:00
PSeitz
5a2397d57e add sstable ord_to_term benchmark (#2242) 2023-11-10 07:27:48 +01:00
PSeitz
927b4432c9 Perf: use term hashmap in fastfield (#2243)
* add shared arena hashmap

* bench fastfield indexing

* use shared arena hashmap in columnar

lower minimum resize in hashtable

* clippy

* add comments
2023-11-09 13:44:02 +01:00
trinity-1686a
7a0064db1f bump index version (#2237)
* bump index version

and add constant for lowest supported version

* use range instead of handcoded bounds
2023-11-06 19:02:37 +01:00
PSeitz
2e7327205d fix coverage run (#2232)
coverage run uses the compare_hash_only feature which is not compativle
with the test_hashmap_size test
2023-11-06 11:18:38 +00:00
Paul Masurel
7bc5bf78e2 Fixing functional tests. (#2239) 2023-11-05 18:18:39 +09:00
giovannicuccu
ef603c8c7e rename ReloadPolicy onCommit to onCommitWithDelay (#2235)
* rename ReloadPolicy onCommit to onCommitWithDelay

* fix format issues

---------

Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it>
2023-11-03 12:22:10 +01:00
PSeitz
28dd6b6546 collect json paths in indexing (#2231)
* collect json paths in indexing

* remove unsafe iter_mut_keys
2023-11-01 11:25:17 +01:00
trinity-1686a
1dda2bb537 handle * inside term in query parser (#2228) 2023-10-27 08:57:02 +02:00
PSeitz
bf6544cf28 fix mmap::Advice reexport (#2230) 2023-10-27 14:09:25 +09:00
PSeitz
ccecf946f7 tantivy 0.21.1 (#2227) 2023-10-27 05:01:44 +02:00
PSeitz
19a859d6fd term hashmap remove copy in is_empty, unused unordered_id (#2229) 2023-10-27 05:01:32 +02:00
PSeitz
83af14caa4 Fix range query (#2226)
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests

Fixes #2225
2023-10-25 09:17:31 +02:00
PSeitz
4feeb2323d fix clippy (#2223) 2023-10-24 10:05:22 +02:00
PSeitz
07bf66a197 json path writer (#2224)
* refactor logic to JsonPathWriter

* use in encode_column_name

* add inlines

* move unsafe block
2023-10-24 09:45:50 +02:00
trinity-1686a
0d4589219b encode some part of posting list as -1 instead of direct values (#2185)
* add support for delta-1 encoding posting list

* encode term frequency minus one

* don't emit tf for json integer terms

* make skipreader not pub(crate) mutable
2023-10-20 16:58:26 +02:00
PSeitz
c2b0469180 improve docs, rework exports (#2220)
* rework exports

move snippet and advice
make indexer pub, remove indexer reexports

* add deprecation warning

* add architecture overview
2023-10-18 09:22:24 +02:00
PSeitz
7e1980b218 run coverage only after merge (#2212)
* run coverage only after merge

coverage is a quite slow step in CI. It can be run only after merging

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-10-18 07:19:36 +02:00
PSeitz
ecb9a89a9f add compat mode for JSON (#2219) 2023-10-17 10:00:55 +02:00
PSeitz
5e06e504e6 split into ReferenceValueLeaf (#2217) 2023-10-16 16:31:30 +02:00
PSeitz
182f58cea6 remove Document: DocumentDeserialize dependency (#2211)
* remove Document: DocumentDeserialize dependency

The dependency requires users to implement an API they may not use.

* remove unnecessary Document bounds
2023-10-13 07:59:54 +02:00
dependabot[bot]
337ffadefd Update lru requirement from 0.11.0 to 0.12.0 (#2208)
Updates the requirements on [lru](https://github.com/jeromefroe/lru-rs) to permit the latest version.
- [Changelog](https://github.com/jeromefroe/lru-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jeromefroe/lru-rs/compare/0.11.0...0.12.0)

---
updated-dependencies:
- dependency-name: lru
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-12 12:09:56 +02:00
dependabot[bot]
22aa4daf19 Update zstd requirement from 0.12 to 0.13 (#2214)
Updates the requirements on [zstd](https://github.com/gyscos/zstd-rs) to permit the latest version.
- [Release notes](https://github.com/gyscos/zstd-rs/releases)
- [Commits](https://github.com/gyscos/zstd-rs/compare/v0.12.0...v0.13.0)

---
updated-dependencies:
- dependency-name: zstd
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-12 04:24:44 +02:00
PSeitz
493f9b2f2a Read list of JSON fields encoded in dictionary (#2184)
* Read list of JSON fields encoded in dictionary

add method to get list of fields on InvertedIndexReader

* add field type
2023-10-09 12:06:22 +02:00
PSeitz
e246e5765d replace ReferenceValue with Self in Value (#2210) 2023-10-06 08:22:15 +02:00
PSeitz
6097235eff fix numeric order, refactor Document (#2209)
fix numeric order to prefer i64
rename and move Document stuff
2023-10-05 16:39:56 +02:00
PSeitz
b700c42246 add AsRef, expose object and array iter on Value (#2207)
add AsRef
expose object and array iter
add to_json on Document
2023-10-05 03:55:35 +02:00
PSeitz
5b1bf1a993 replace Field with field name (#2196) 2023-10-04 06:21:40 +02:00
PSeitz
041d4fced7 move to_named_doc to Document trait (#2205) 2023-10-04 06:03:07 +02:00
dependabot[bot]
166fc15239 Update memmap2 requirement from 0.7.1 to 0.9.0 (#2204)
Updates the requirements on [memmap2](https://github.com/RazrFalcon/memmap2-rs) to permit the latest version.
- [Changelog](https://github.com/RazrFalcon/memmap2-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/RazrFalcon/memmap2-rs/compare/v0.7.1...v0.9.0)

---
updated-dependencies:
- dependency-name: memmap2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-04 05:00:46 +02:00
PSeitz
514a6e7fef fix bench compile, fix Document reexport (#2203) 2023-10-03 17:28:36 +02:00
dependabot[bot]
82d9127191 Update fs4 requirement from 0.6.3 to 0.7.0 (#2199)
Updates the requirements on [fs4](https://github.com/al8n/fs4-rs) to permit the latest version.
- [Release notes](https://github.com/al8n/fs4-rs/releases)
- [Commits](https://github.com/al8n/fs4-rs/commits/0.7.0)

---
updated-dependencies:
- dependency-name: fs4
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-03 04:43:09 +02:00