Commit Graph

2943 Commits

Author SHA1 Message Date
RT_Enzyme
ff3d3313c4 fix BooleanQuery document (#1999)
* fix BooleanQuery document

* Update src/query/boolean_query/boolean_query.rs

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-20 11:37:20 +02:00
Paul Masurel
fbda511a1a Making more things public for quickwit. (#2005) 2023-04-20 11:37:45 +09:00
Adam Reichold
c1defdda05 Bump aho-corasick dependency to version 1.0 and adjust to API changes (#2002)
* Drop additional Arc-layer as the automaton itself is now cheap-to-clone.
* Drop state ID type parameter as it is not exposed by the library any more.
2023-04-18 07:34:30 +02:00
PSeitz
e522163a1c use json in agg tests (#1998)
* switch to JSON in tests, add flat aggregation types

* use method

* clippy

* remove commented file
2023-04-17 14:08:48 +02:00
PSeitz
e83abbfe4a perf: faster term hash map (#1940)
* add term hashmap benchmark

* refactor arena hashmap

add inlines
remove occupied array and use table_entry.is_empty instead (saves 4 bytes per entry)
reduce saturation threshold from 1/3 to 1/2 to reduce memory
use u32 for UnorderedId (we have the 4billion limit anyways on the Columnar stuff)
fix naming LinearProbing
remove byteorder dependency

memory consumption went down from 2Gb to 1.8GB on indexing wikipedia dataset in tantivy

* Update stacker/src/arena_hashmap.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-17 09:07:33 +02:00
trinity-1686a
780e26331d sstable compression (#1946)
* compress sstable with zstd

* add some details to sstable readme

* compress only block which benefit from it

* multiple changes to sstable

make compression optional
use OwnedBytes instead of impl Read in sstable, required for next point
use zstd bulk api, which is much faster on small records

* cleanup and use bulk api for compression

* use dedicated byte for compression

* switch block len and compression flag

* change default zstd level in sstable
2023-04-14 16:25:50 +02:00
trinity-1686a
0286ecea09 re-export a few sstable functions on dicitonary (#1996)
* re-export a few sstable functions on dicitonary

* Update documentation

Co-authored-by: François Massot <francois.massot@gmail.com>

---------

Co-authored-by: François Massot <francois.massot@gmail.com>
2023-04-14 11:13:48 +02:00
PSeitz
b0ef9a6252 use crates.io dependency (#1990) 2023-04-14 09:35:20 +08:00
François Massot
36138c493b Merge pull request #1994 from quickwit-oss/fmassot/expose-simple-token-stream
Expose `SimpleTokenStream` to use it in quickwit for the multilanguage tokenizer
2023-04-13 18:55:02 +02:00
François Massot
64bce340b2 Expose to use it in quickwit. 2023-04-13 18:28:53 +02:00
trinity-1686a
205e8a0a92 encode dictionary type in fst footer (#1968)
* encode additional footer for dictionary kind in fst
2023-04-12 09:43:01 +02:00
Paul Masurel
4b01cc4c49 Made BooleanWeight and BoostWeight public (#1991) 2023-04-12 10:26:30 +09:00
PSeitz
0ed13eeea8 add sparse to agg benchmark (#1986)
* add sparse to agg benchmark

* Update src/aggregation/agg_bench.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-11 08:13:32 +02:00
Tony-X
91a38058fe Fix typo in READEME.md (#1989) 2023-04-11 12:07:20 +09:00
PSeitz
41af70799d add percentiles aggregations (#1984)
* add percentiles aggregations

add percentiles aggregation
fix disabled agg benchmark

* Update src/aggregation/metric/percentiles.rs

Co-authored-by: Paul Masurel <paul@quickwit.io>

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* fix import

* fix import

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-07 07:18:28 +02:00
Paul Masurel
f853bf204b Align the numerical type priority order with columnar. (#1978)
Closes #1956
2023-04-07 10:07:54 +09:00
Tony-X
11ae48d3bc Update benchmarks section in READEME.md to link to the bench repo (#1985)
* Update benchmarks section in READEME.md to link to the bench repo

* Apply suggestions from code review

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-04-07 10:07:06 +09:00
Paul Masurel
5eb12173d6 Proptest merge columnar (#1976)
* Added proptest on columnar merge with a shuffle

Made column serialization more explicit.
Bugfix when a bytes column is missing, and with a shuffle.
Improved the cardinality detection logic / column detection.

* Code review

* CR comments

* Following CR
2023-04-04 11:28:42 +09:00
PSeitz
5c4ea6a708 tokenizer option on text fastfield (#1945)
* tokenizer option on text fastfield

allow to set tokenizer option on text fastfield (fixes #1901)
handle PreTokenized strings in fast field

* change visibility

* remove custom de/serialization
2023-03-31 10:03:38 +02:00
PSeitz
4cf93dab7d fix build (#1973) 2023-03-31 13:54:03 +09:00
PSeitz
5c380b76e7 Better mixed types support in aggs and fix serialization issue (#1971)
* Better mixed types support in aggs and fix serialization issue

- Improve support for mixed types in JSON field aggregations (pick the right field, #1913)
- Resolve the issue with JSON serialization for numeric keys (fixes #1967)
- Add JSON round-trip test for term buckets
- Remove `u64_lenient`, as this is a footgun without the type
- move aggregation benchmarks

* remove shadowing
2023-03-31 05:52:11 +02:00
PSeitz
571735c5f7 Fix index sort by on optional/multicolumn (#1972)
Fix index sort by on optional/multicolumn
add optional columns to proptest
extend proptests for sort
add columnar sort tests
2023-03-31 04:24:11 +02:00
zhouhui
8e92f960d3 Fix comment: change max_merge_size to max_docs_before_merge. (#1970) 2023-03-28 22:49:00 +09:00
Paul Masurel
057211c3d8 Fixing build on arm (#1966) 2023-03-27 22:42:57 +09:00
Paul Masurel
059fc767ea Added ::MIN ::MAX DateTime. (#1965) 2023-03-27 15:32:53 +09:00
Paul Masurel
694a056255 Faster range (#1954)
* Faster range queries

This PR does several changes
- ip compact space now uses u32
- the bitunpacker now gets a get_batch function
- we push down range filtering, removing GCD / shift in the bitpacking
  codec.
- we rely on AVX2 routine to do the filtering.

* Apply suggestions from code review

* Apply suggestions from code review

* CR comments
2023-03-27 14:56:32 +09:00
Paul Masurel
2955e34452 Added proptests for building/merging columnar. (#1963) 2023-03-27 14:56:02 +09:00
Paul Masurel
821208480b Adding Debug/Display impl. Refining the ColumnIndex::get_cardinality 2023-03-26 14:40:37 +09:00
Paul Masurel
a2e3c2ed5b Renaming Column::idx -> Column::index (#1961)
There was some variable name ghosting happening.
2023-03-26 13:58:50 +09:00
PSeitz
835f228bfa fix cardinality when merging empty columns (#1960)
fixes #1958
2023-03-25 15:58:15 +09:00
Paul Masurel
2b6a4da640 Exposing empty column builder. (#1959) 2023-03-24 16:34:41 +09:00
PSeitz
d6a95381ee add memory check for term agg (#1957) 2023-03-24 06:47:45 +01:00
PSeitz
da2804644f fetch blocks of vals in aggregation for all cardinality (#1950)
* fetch blocks of vals in aggregation for all cardinality

* move caching in common accessor
2023-03-23 08:41:11 +01:00
PSeitz
5504cfd012 remove IterColumn (#1955)
fixes #1658
2023-03-23 06:43:17 +01:00
trinity-1686a
482b4155e8 fix bug with new sstable index format (#1953) 2023-03-22 10:22:36 +01:00
Till Wegmüller
1a35f6573d Switch fs2 to fs4 as it is now unmaintained and does not support illumos (#1944)
Signed-off-by: Till Wegmueller <toasterson@gmail.com>
2023-03-22 13:48:49 +09:00
trinity-1686a
e5e50603a8 new sstable format (#1943)
* document a new sstable format

* add support for changing target block size

* use new format for sstable index

* handle sstable version errror

* use very small blocks for proptests

* add a footer structure
2023-03-21 15:03:52 +01:00
PSeitz
8f7f1d6be4 add Display for ByteCount (#1949)
* add Display for ByteCount

* export missing AggregationLimits
2023-03-21 08:02:35 +01:00
PSeitz
6a7a1106d6 work in batches of docs (#1937)
* work in batches of docs

* add fill_buffer test
2023-03-21 06:57:44 +01:00
PSeitz
9e2faecf5b add memory limit for aggregations (#1942)
* add memory limit for aggregations

introduce AggregationLimits to set memory consumption limit and bucket limits
memory limit is checked during aggregation, bucket limit is checked before returning the aggregation request.

* Apply suggestions from code review

Co-authored-by: Paul Masurel <paul@quickwit.io>

* add ByteCount with human readable format

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-03-16 06:21:07 +01:00
PSeitz
b6703f1b3c fix validation in date histogram (#1936)
fix validation in date histogram for parameters interval and date_interval
2023-03-15 06:10:43 +01:00
PSeitz
2fb3740cb0 handle missing column for aggs (#1920)
* handle missing column for aggs

add empty column fallback for missing column in aggs.
Fix sort for term agg on sub-agg with missing value (null is smallest)

* add error when field is not fast
2023-03-15 06:09:59 +01:00
PSeitz
8459efa32c split term collection count and sub_agg (#1921)
use unrolled ColumnValues::get_vals
2023-03-13 04:37:41 +01:00
PSeitz
61cfd8dc57 fix clippy (#1927) 2023-03-13 03:12:02 +01:00
trinity-1686a
064518156f refactor tokenization pipeline to use GATs (#1924)
* refactor tokenization pipeline to use GATs

* fix doctests

* fix clippy lints

* remove commented code
2023-03-09 09:39:37 +01:00
PSeitz
a42a96f470 fix panic in dict column merge (#1930)
* fix panic in dict column merge

* Bugfix and added unit test

---------

Co-authored-by: Paul Masurel <paul@quickwit.io>
2023-03-08 22:04:37 +09:00
trinity-1686a
fcf5a25d93 use DeltaReader directly to implement Dictionnary::ord_to_term (#1928) 2023-03-08 11:15:56 +09:00
dependabot[bot]
c0a5b28fd3 Update lru requirement from 0.9.0 to 0.10.0 (#1932)
Updates the requirements on [lru](https://github.com/jeromefroe/lru-rs) to permit the latest version.
- [Release notes](https://github.com/jeromefroe/lru-rs/releases)
- [Changelog](https://github.com/jeromefroe/lru-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jeromefroe/lru-rs/compare/0.9.0...0.10.0)

---
updated-dependencies:
- dependency-name: lru
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-03-07 15:09:02 +09:00
trinity-1686a
a4f7ca8309 use DeltaReader directly to implement Dictionnary::term_ord (#1925)
* use DeltaReader directly to implement Dictionnary::term_ord

* add some additional test case for Dictionary::term_ord
2023-03-06 09:45:22 +01:00
Paul Masurel
364e321415 Clippy fix (#1926) 2023-03-06 10:37:17 +09:00