Giovanni Cuccu
4c6c7bbbd1
Merge branch 'main' into issue_1781_extended_stats_boxed_result
2023-11-24 14:36:56 +01:00
Giovanni Cuccu
1dd6182e3a
cosmetic refactor
2023-11-24 11:50:12 +01:00
Giovanni Cuccu
2548b79615
refactored a struct
2023-11-23 14:40:17 +01:00
PSeitz
1a9fc10be9
add fields_metadata to SegmentReader, add columnar docs ( #2222 )
...
* add fields_metadata to SegmentReader, add columnar docs
* use schema to resolve field, add test
* normalize paths
* merge for FieldsMetadata, add fields_metadata on Index
* Update src/core/segment_reader.rs
Co-authored-by: Paul Masurel <paul@quickwit.io >
* merge code paths
* add Hash
* move function oustide
---------
Co-authored-by: Paul Masurel <paul@quickwit.io >
2023-11-22 12:29:53 +01:00
PSeitz
07573a7f19
update fst ( #2267 )
...
update fst to 0.5 (deduplicates regex-syntax in the dep tree)
deps cleanup
2023-11-21 16:06:57 +01:00
Giovanni Cuccu
8c7df7ad31
refined version with code formatted
2023-11-20 15:29:18 +01:00
Giovanni Cuccu
dfe46d5f84
interim version
2023-11-20 14:48:13 +01:00
BlackHoleFox
daad2dc151
Take string references instead of owned values building Facet paths ( #2265 )
2023-11-20 09:40:44 +01:00
PSeitz
054f49dc31
support escaped dot, add agg test ( #2250 )
...
add agg test for nested JSON
allow escaping of dot
2023-11-20 03:00:57 +01:00
PSeitz
47009ed2d3
remove unused deps ( #2264 )
...
found with cargo machete
remove pprof (doesn't work)
2023-11-20 02:59:59 +01:00
Giovanni Cuccu
e923daa8a6
refactor for using ExtendedStats only when needed
2023-11-19 15:15:53 +01:00
PSeitz
0aae31d7d7
reduce number of allocations ( #2257 )
...
* reduce number of allocations
Explanation makes up around 50% of all allocations (numbers not perf).
It's created during serialization but not called.
- Make Explanation optional in BM25
- Avoid allocations when using Explanation
* use Cow
2023-11-16 13:47:36 +01:00
Paul Masurel
9caab45136
Preparing for 0.21.2 release. ( #2256 )
2023-11-15 10:43:36 +09:00
Chris Tam
6d9a7b7eb0
Derive Debug for SchemaBuilder ( #2254 )
2023-11-15 01:03:44 +01:00
dependabot[bot]
7a2c5804b1
Update itertools requirement from 0.11.0 to 0.12.0 ( #2255 )
...
Updates the requirements on [itertools](https://github.com/rust-itertools/itertools ) to permit the latest version.
- [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md )
- [Commits](https://github.com/rust-itertools/itertools/compare/v0.11.0...v0.12.0 )
---
updated-dependencies:
- dependency-name: itertools
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-15 01:03:08 +01:00
François Massot
5319977171
Merge pull request #2253 from quickwit-oss/issue/2251-bug-merge-json-object-with-number
...
Fix bug occuring when merging JSON object indexed with positions.
2023-11-14 17:28:29 +01:00
trinity-1686a
828632e8c4
rustfmt
2023-11-14 15:05:16 +01:00
Paul Masurel
6b59ec6fd5
Fix bug occuring when merging JSON object indexed with positions.
...
In JSON Object field the presence of term frequencies depend on the
field.
Typically, a string with postiions indexed will have positions
while numbers won't.
The presence or absence of term freqs for a given term is unfortunately
encoded in a very passive way.
It is given by the presence of extra information in the skip info, or
the lack of term freqs after decoding vint blocks.
Before, after writing a segment, we would encode the segment correctly
(without any term freq for number in json object field).
However during merge, we would get the default term freq=1 value.
(this is default in the absence of encoded term freqs)
The merger would then proceed and attempt to decode 1 position when
there are in fact none.
This PR requires to explictly tell the posting serialize whether
term frequencies should be serialized for each new term.
Closes #2251
2023-11-14 22:41:48 +09:00
PSeitz
b60d862150
docid deltas while indexing ( #2249 )
...
* docid deltas while indexing
storing deltas is especially helpful for repetitive data like logs.
In those cases, recording a doc on a term costed 4 bytes instead of 1
byte now.
HDFS Indexing 1.1GB Total memory consumption:
Before: 760 MB
Now: 590 MB
* use scan for delta decoding
2023-11-13 05:14:27 +01:00
Giovanni Cuccu
3e83f4a8dc
removed approx dependency
2023-11-11 11:07:00 +01:00
Giovanni Cuccu
061b56b90e
Merge branch 'main' into issue_1787_extended_stats
2023-11-10 16:48:43 +01:00
Giovanni Cuccu
9efb1f7787
version ready for merge
2023-11-10 16:47:18 +01:00
Giovanni Cuccu
86f3b56304
kahan summation and tests with approximate equality
2023-11-10 08:21:26 +01:00
PSeitz
4837c7811a
add missing inlines ( #2245 )
2023-11-10 08:00:42 +01:00
PSeitz
5a2397d57e
add sstable ord_to_term benchmark ( #2242 )
2023-11-10 07:27:48 +01:00
PSeitz
927b4432c9
Perf: use term hashmap in fastfield ( #2243 )
...
* add shared arena hashmap
* bench fastfield indexing
* use shared arena hashmap in columnar
lower minimum resize in hashtable
* clippy
* add comments
2023-11-09 13:44:02 +01:00
trinity-1686a
7a0064db1f
bump index version ( #2237 )
...
* bump index version
and add constant for lowest supported version
* use range instead of handcoded bounds
2023-11-06 19:02:37 +01:00
PSeitz
2e7327205d
fix coverage run ( #2232 )
...
coverage run uses the compare_hash_only feature which is not compativle
with the test_hashmap_size test
2023-11-06 11:18:38 +00:00
Paul Masurel
7bc5bf78e2
Fixing functional tests. ( #2239 )
2023-11-05 18:18:39 +09:00
Giovanni Cuccu
bd7d7e3b8c
first test with extended_stats
2023-11-04 18:20:46 +01:00
Giovanni Cuccu
db91df9f70
Created struct for request and response
2023-11-04 16:02:10 +01:00
Giovanni Cuccu
86bdb8b95c
using IntermediateExtendStats instead of IntermediateStats with all tests passing
2023-11-04 12:34:04 +01:00
Giovanni Cuccu
8d4f2db9d9
first version of extended stats along with its tests
2023-11-04 11:32:51 +01:00
giovannicuccu
ef603c8c7e
rename ReloadPolicy onCommit to onCommitWithDelay ( #2235 )
...
* rename ReloadPolicy onCommit to onCommitWithDelay
* fix format issues
---------
Co-authored-by: Giovanni Cuccu <gcuccu@imolainformatica.it >
2023-11-03 12:22:10 +01:00
PSeitz
28dd6b6546
collect json paths in indexing ( #2231 )
...
* collect json paths in indexing
* remove unsafe iter_mut_keys
2023-11-01 11:25:17 +01:00
trinity-1686a
1dda2bb537
handle * inside term in query parser ( #2228 )
2023-10-27 08:57:02 +02:00
PSeitz
bf6544cf28
fix mmap::Advice reexport ( #2230 )
2023-10-27 14:09:25 +09:00
PSeitz
ccecf946f7
tantivy 0.21.1 ( #2227 )
2023-10-27 05:01:44 +02:00
PSeitz
19a859d6fd
term hashmap remove copy in is_empty, unused unordered_id ( #2229 )
2023-10-27 05:01:32 +02:00
PSeitz
83af14caa4
Fix range query ( #2226 )
...
Fix range query end check in advance
Rename vars to reduce ambiguity
add tests
Fixes #2225
2023-10-25 09:17:31 +02:00
PSeitz
4feeb2323d
fix clippy ( #2223 )
2023-10-24 10:05:22 +02:00
PSeitz
07bf66a197
json path writer ( #2224 )
...
* refactor logic to JsonPathWriter
* use in encode_column_name
* add inlines
* move unsafe block
2023-10-24 09:45:50 +02:00
trinity-1686a
0d4589219b
encode some part of posting list as -1 instead of direct values ( #2185 )
...
* add support for delta-1 encoding posting list
* encode term frequency minus one
* don't emit tf for json integer terms
* make skipreader not pub(crate) mutable
2023-10-20 16:58:26 +02:00
PSeitz
c2b0469180
improve docs, rework exports ( #2220 )
...
* rework exports
move snippet and advice
make indexer pub, remove indexer reexports
* add deprecation warning
* add architecture overview
2023-10-18 09:22:24 +02:00
PSeitz
7e1980b218
run coverage only after merge ( #2212 )
...
* run coverage only after merge
coverage is a quite slow step in CI. It can be run only after merging
* Apply suggestions from code review
Co-authored-by: Paul Masurel <paul@quickwit.io >
---------
Co-authored-by: Paul Masurel <paul@quickwit.io >
2023-10-18 07:19:36 +02:00
PSeitz
ecb9a89a9f
add compat mode for JSON ( #2219 )
2023-10-17 10:00:55 +02:00
PSeitz
5e06e504e6
split into ReferenceValueLeaf ( #2217 )
2023-10-16 16:31:30 +02:00
PSeitz
182f58cea6
remove Document: DocumentDeserialize dependency ( #2211 )
...
* remove Document: DocumentDeserialize dependency
The dependency requires users to implement an API they may not use.
* remove unnecessary Document bounds
2023-10-13 07:59:54 +02:00
dependabot[bot]
337ffadefd
Update lru requirement from 0.11.0 to 0.12.0 ( #2208 )
...
Updates the requirements on [lru](https://github.com/jeromefroe/lru-rs ) to permit the latest version.
- [Changelog](https://github.com/jeromefroe/lru-rs/blob/master/CHANGELOG.md )
- [Commits](https://github.com/jeromefroe/lru-rs/compare/0.11.0...0.12.0 )
---
updated-dependencies:
- dependency-name: lru
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-12 12:09:56 +02:00
dependabot[bot]
22aa4daf19
Update zstd requirement from 0.12 to 0.13 ( #2214 )
...
Updates the requirements on [zstd](https://github.com/gyscos/zstd-rs ) to permit the latest version.
- [Release notes](https://github.com/gyscos/zstd-rs/releases )
- [Commits](https://github.com/gyscos/zstd-rs/compare/v0.12.0...v0.13.0 )
---
updated-dependencies:
- dependency-name: zstd
dependency-type: direct:production
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-12 04:24:44 +02:00