Commit Graph

26 Commits

Author SHA1 Message Date
PSeitz
927b4432c9 Perf: use term hashmap in fastfield (#2243)
* add shared arena hashmap

* bench fastfield indexing

* use shared arena hashmap in columnar

lower minimum resize in hashtable

* clippy

* add comments
2023-11-09 13:44:02 +01:00
PSeitz
28dd6b6546 collect json paths in indexing (#2231)
* collect json paths in indexing

* remove unsafe iter_mut_keys
2023-11-01 11:25:17 +01:00
PSeitz
19a859d6fd term hashmap remove copy in is_empty, unused unordered_id (#2229) 2023-10-27 05:01:32 +02:00
PSeitz
49448b31c6 chore: Release (#2168)
* chore: Release

* update CHANGELOG
2023-09-01 13:58:58 +02:00
PSeitz
480763db0d track memory arena memory usage (#2148) 2023-08-16 18:19:42 +02:00
Adam Reichold
ebc78127f3 Add BytesFilterCollector to support filtering based on a bytes fast field (#2075)
* Do some Clippy- and Cargo-related boy-scouting.

* Add BytesFilterCollector to support filtering based on a bytes fast field

This is basically a copy of the existing FilterCollector but modified and
specialised to work on a bytes fast field.

* Changed semantics of filter collectors to consider multi-valued fields
2023-06-13 14:19:58 +09:00
PSeitz
e3eacb4388 release tantivy (#2083)
* prerelease

* chore: Release
2023-06-09 10:47:46 +02:00
PSeitz
27f202083c Improve Termmap Indexing Performance +~30% (#2058)
* update benchmark

* Improve Termmap Indexing Performance +~30%

This contains many small changes to improve Termmap performance.
Most notably:
* Specialized byte compare and equality versions, instead of glibc calls.
* ExpUnrolledLinkedList to not contain inline items.

Allow compare hash only via a feature flag compare_hash_only:
64bits should be enough with a good hash function to compare strings by
their hashes instead of comparing the strings. Disabled by default

CreateHashMap/alice/174693
                        time:   [642.23 µs 643.80 µs 645.24 µs]
                        thrpt:  [258.20 MiB/s 258.78 MiB/s 259.41 MiB/s]
                 change:
                        time:   [-14.429% -13.303% -12.348%] (p = 0.00 < 0.05)
                        thrpt:  [+14.088% +15.344% +16.862%]
                        Performance has improved.
CreateHashMap/alice_expull/174693
                        time:   [877.03 µs 880.44 µs 884.67 µs]
                        thrpt:  [188.32 MiB/s 189.22 MiB/s 189.96 MiB/s]
                 change:
                        time:   [-26.460% -26.274% -26.091%] (p = 0.00 < 0.05)
                        thrpt:  [+35.301% +35.637% +35.981%]
                        Performance has improved.
CreateHashMap/numbers_zipf/8000000
                        time:   [9.1198 ms 9.1573 ms 9.1961 ms]
                        thrpt:  [829.64 MiB/s 833.15 MiB/s 836.57 MiB/s]
                 change:
                        time:   [-35.229% -34.828% -34.384%] (p = 0.00 < 0.05)
                        thrpt:  [+52.403% +53.440% +54.390%]
                        Performance has improved.

* clippy

* add bench for ids

* inline(always) to inline whole block with bounds checks

* cleanup
2023-06-08 11:13:52 +02:00
Paul Masurel
47e01b345b Simplified linear probing code (#2066) 2023-06-01 04:58:42 +02:00
dependabot[bot]
4be6f83b0a Update criterion requirement from 0.4 to 0.5 (#2056)
Updates the requirements on [criterion](https://github.com/bheisler/criterion.rs) to permit the latest version.
- [Changelog](https://github.com/bheisler/criterion.rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/bheisler/criterion.rs/compare/0.4.0...0.5.0)

---
updated-dependencies:
- dependency-name: criterion
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-24 15:59:51 +09:00
PSeitz
00c5df610c update termmap benchmark (#2040) 2023-05-12 07:35:06 +02:00
PSeitz
d1988be8e9 fix and extend benchmark (#2030)
* add benchmark, add missing inlines

* fix stacker bench

* add wiki benchmark

* move line split out of bench
2023-05-10 13:01:56 +02:00
PSeitz
d3357a8426 fix ArenaHashMap default (#2034)
an empty ArenaHashMap is invalid and causes a panic when combined with `get`
2023-05-10 11:39:47 +02:00
Yuri Astrakhan
74275b76a6 Inline format arguments where makes sense (#2038)
Applied this command to the code, making it a bit shorter and slightly
more readable.

```
cargo +nightly clippy --all-features --benches --tests --workspace --fix -- -A clippy::all -W clippy::uninlined_format_args
cargo +nightly fmt --all
```
2023-05-10 18:03:59 +09:00
tottoto
73452284ae Remove unused crates from dependencies (#2018)
* Remove unused crates from dependencies

* Revert rand to columnar

* Revert criterion to stacker
2023-05-02 12:34:20 +02:00
PSeitz
7b31100208 refactor vint (#2010)
- improve performance of vint
vint serialization shows up in performance profiles during indexing.
It would also make sense to limit the value space to u29 and operate on 4 bytes only.
- remove unused code
- add missing inlines
- fix regex test
2023-04-25 08:49:36 +02:00
PSeitz
e83abbfe4a perf: faster term hash map (#1940)
* add term hashmap benchmark

* refactor arena hashmap

add inlines
remove occupied array and use table_entry.is_empty instead (saves 4 bytes per entry)
reduce saturation threshold from 1/3 to 1/2 to reduce memory
use u32 for UnorderedId (we have the 4billion limit anyways on the Columnar stuff)
fix naming LinearProbing
remove byteorder dependency

memory consumption went down from 2Gb to 1.8GB on indexing wikipedia dataset in tantivy

* Update stacker/src/arena_hashmap.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-17 09:07:33 +02:00
Paul Masurel
ed5a3b3172 Bumped murmurhash version 2023-03-03 21:24:32 +09:00
Paul Masurel
097fd6138d Fix clippy comments (#1872) 2023-02-14 23:12:45 +09:00
Paul Masurel
60cc2644d6 Fixing test_fail_on_flush_segment_but_one_worker_remains (#1869)
The new fast field code, based on columnar, had a larger minimum memory
footprint, causing the first docuemnt to trigger a flush of the asegment
in this unit test.

This PR prevents the allocation of a large capacity for the different hashmap tables
using in the columnar writer.

Closes #1859
2023-02-14 16:09:42 +09:00
Paul Masurel
bd5eea9852 Integrated columnar work. 2023-02-09 13:14:31 +01:00
Adrien Guillo
14222a47a3 Fix typo (#1776) 2023-01-11 00:49:13 +09:00
Paul Masurel
2a6d1eaf78 Added missing license. 2022-12-22 12:47:43 +09:00
Paul Masurel
f39165e1e7 Moving FileSlice to tantivy-common (#1729) 2022-12-21 16:35:11 +09:00
PSeitz
f9171a3981 fix clippy (#1725)
* fix clippy

* fix clippy fastfield codecs

* fix clippy bitpacker

* fix clippy common

* fix clippy stacker

* fix clippy sstable

* fmt
2022-12-20 07:30:06 +01:00
Paul Masurel
136a8f4124 Isolating sstable and stacker in independant crates. (#1718)
Both crate will be used in the new (optional + dynamic) fastfield work.
2022-12-13 11:44:17 +09:00