mirror of
https://github.com/quickwit-oss/tantivy.git
synced 2026-06-22 10:20:43 +00:00
Compare commits
2 Commits
mallets/so
...
bench_0.25
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f1dd69c616 | ||
|
|
747caeb568 |
@@ -1,125 +0,0 @@
|
||||
---
|
||||
name: rationalize-deps
|
||||
description: Analyze Cargo.toml dependencies and attempt to remove unused features to reduce compile times and binary size
|
||||
---
|
||||
|
||||
# Rationalize Dependencies
|
||||
|
||||
This skill analyzes Cargo.toml dependencies to identify and remove unused features.
|
||||
|
||||
## Overview
|
||||
|
||||
Many crates enable features by default that may not be needed. This skill:
|
||||
1. Identifies dependencies with default features enabled
|
||||
2. Tests if `default-features = false` works
|
||||
3. Identifies which specific features are actually needed
|
||||
4. Verifies compilation after changes
|
||||
|
||||
## Step 1: Identify the target
|
||||
|
||||
Ask the user which crate(s) to analyze:
|
||||
- A specific crate name (e.g., "tokio", "serde")
|
||||
- A specific workspace member (e.g., "quickwit-search")
|
||||
- "all" to scan the entire workspace
|
||||
|
||||
## Step 2: Analyze current dependencies
|
||||
|
||||
For the workspace Cargo.toml (`quickwit/Cargo.toml`), list dependencies that:
|
||||
- Do NOT have `default-features = false`
|
||||
- Have default features that might be unnecessary
|
||||
|
||||
Run: `cargo tree -p <crate> -f "{p} {f}" --edges features` to see what features are actually used.
|
||||
|
||||
## Step 3: For each candidate dependency
|
||||
|
||||
### 3a: Check the crate's default features
|
||||
|
||||
Look up the crate on crates.io or check its Cargo.toml to understand:
|
||||
- What features are enabled by default
|
||||
- What each feature provides
|
||||
|
||||
Use: `cargo metadata --format-version=1 | jq '.packages[] | select(.name == "<crate>") | .features'`
|
||||
|
||||
### 3b: Try disabling default features
|
||||
|
||||
Modify the dependency in `quickwit/Cargo.toml`:
|
||||
|
||||
From:
|
||||
```toml
|
||||
some-crate = { version = "1.0" }
|
||||
```
|
||||
|
||||
To:
|
||||
```toml
|
||||
some-crate = { version = "1.0", default-features = false }
|
||||
```
|
||||
|
||||
### 3c: Run cargo check
|
||||
|
||||
Run: `cargo check --workspace` (or target specific packages for faster feedback)
|
||||
|
||||
If compilation fails:
|
||||
1. Read the error messages to identify which features are needed
|
||||
2. Add only the required features explicitly:
|
||||
```toml
|
||||
some-crate = { version = "1.0", default-features = false, features = ["needed-feature"] }
|
||||
```
|
||||
3. Re-run cargo check
|
||||
|
||||
### 3d: Binary search for minimal features
|
||||
|
||||
If there are many default features, use binary search:
|
||||
1. Start with no features
|
||||
2. If it fails, add half the default features
|
||||
3. Continue until you find the minimal set
|
||||
|
||||
## Step 4: Document findings
|
||||
|
||||
For each dependency analyzed, report:
|
||||
- Original configuration
|
||||
- New configuration (if changed)
|
||||
- Features that were removed
|
||||
- Any features that are required
|
||||
|
||||
## Step 5: Verify full build
|
||||
|
||||
After all changes, run:
|
||||
```bash
|
||||
cargo check --workspace --all-targets
|
||||
cargo test --workspace --no-run
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Serde
|
||||
Often only needs `derive`:
|
||||
```toml
|
||||
serde = { version = "1.0", default-features = false, features = ["derive", "std"] }
|
||||
```
|
||||
|
||||
### Tokio
|
||||
Identify which runtime features are actually used:
|
||||
```toml
|
||||
tokio = { version = "1.0", default-features = false, features = ["rt-multi-thread", "macros", "sync"] }
|
||||
```
|
||||
|
||||
### Reqwest
|
||||
Often doesn't need all TLS backends:
|
||||
```toml
|
||||
reqwest = { version = "0.11", default-features = false, features = ["rustls-tls", "json"] }
|
||||
```
|
||||
|
||||
## Rollback
|
||||
|
||||
If changes cause issues:
|
||||
```bash
|
||||
git checkout quickwit/Cargo.toml
|
||||
cargo check --workspace
|
||||
```
|
||||
|
||||
## Tips
|
||||
|
||||
- Start with large crates that have many default features (tokio, reqwest, hyper)
|
||||
- Use `cargo bloat --crates` to identify large dependencies
|
||||
- Check `cargo tree -d` for duplicate dependencies that might indicate feature conflicts
|
||||
- Some features are needed only for tests - consider using `[dev-dependencies]` features
|
||||
@@ -1,60 +0,0 @@
|
||||
---
|
||||
name: simple-pr
|
||||
description: Create a simple PR from staged changes with an auto-generated commit message
|
||||
disable-model-invocation: true
|
||||
---
|
||||
|
||||
# Simple PR
|
||||
|
||||
Follow these steps to create a simple PR from staged changes:
|
||||
|
||||
## Step 1: Check workspace state
|
||||
|
||||
Run: `git status`
|
||||
|
||||
Verify that all changes have been staged (no unstaged changes). If there are unstaged changes, abort and ask the user to stage their changes first with `git add`.
|
||||
|
||||
Also verify that we are on the `main` branch. If not, abort and ask the user to switch to main first.
|
||||
|
||||
## Step 2: Ensure main is up to date
|
||||
|
||||
Run: `git pull origin main`
|
||||
|
||||
This ensures we're working from the latest code.
|
||||
|
||||
## Step 3: Review staged changes
|
||||
|
||||
Run: `git diff --cached`
|
||||
|
||||
Review the staged changes to understand what the PR will contain.
|
||||
|
||||
## Step 4: Generate commit message
|
||||
|
||||
Based on the staged changes, generate a concise commit message (1-2 sentences) that describes the "why" rather than the "what".
|
||||
|
||||
Display the proposed commit message to the user and ask for confirmation before proceeding.
|
||||
|
||||
## Step 5: Create a new branch
|
||||
|
||||
Get the git username: `git config user.name | tr ' ' '-' | tr '[:upper:]' '[:lower:]'`
|
||||
|
||||
Create a short, descriptive branch name based on the changes (e.g., `fix-typo-in-readme`, `add-retry-logic`, `update-deps`).
|
||||
|
||||
Create and checkout the branch: `git checkout -b {username}/{short-descriptive-name}`
|
||||
|
||||
## Step 6: Commit changes
|
||||
|
||||
Commit with the message from step 3:
|
||||
```
|
||||
git commit -m "{commit-message}"
|
||||
```
|
||||
|
||||
## Step 7: Push and open a PR
|
||||
|
||||
Push the branch and open a PR:
|
||||
```
|
||||
git push -u origin {branch-name}
|
||||
gh pr create --title "{commit-message-title}" --body "{longer-description-if-needed}"
|
||||
```
|
||||
|
||||
Report the PR URL to the user when complete.
|
||||
@@ -1,87 +0,0 @@
|
||||
---
|
||||
name: update-changelog
|
||||
description: Update CHANGELOG.md with merged PRs since the last changelog update, categorized by type
|
||||
---
|
||||
|
||||
# Update Changelog
|
||||
|
||||
This skill updates CHANGELOG.md with merged PRs that aren't already listed.
|
||||
|
||||
## Step 1: Determine the changelog scope
|
||||
|
||||
Read `CHANGELOG.md` to identify the current unreleased version section at the top (e.g., `Tantivy 0.26 (Unreleased)`).
|
||||
|
||||
Collect all PR numbers already mentioned in the unreleased section by extracting `#NNNN` references.
|
||||
|
||||
## Step 2: Find merged PRs not yet in the changelog
|
||||
|
||||
Use `gh` to list recently merged PRs from the upstream repo:
|
||||
|
||||
```bash
|
||||
gh pr list --repo quickwit-oss/tantivy --state merged --limit 100 --json number,title,author,labels,mergedAt
|
||||
```
|
||||
|
||||
Filter out any PRs whose number already appears in the unreleased section of the changelog.
|
||||
|
||||
## Step 3: Consolidate related PRs
|
||||
|
||||
Before categorizing, group PRs that belong to the same logical change. This is critical for producing a clean changelog. Use PR descriptions, titles, cross-references, and the files touched to identify relationships.
|
||||
|
||||
**Merge follow-up PRs into the original:**
|
||||
- If a PR is a bugfix, refinement, or follow-up to another PR in the same unreleased cycle, combine them into a single changelog entry with multiple `[#N](url)` links.
|
||||
- Also consolidate PRs that touch the same feature area even if not explicitly linked — e.g., a PR fixing an edge case in a new API should be folded into the entry for the PR that introduced that API.
|
||||
|
||||
**Filter out bugfixes on unreleased features:**
|
||||
- If a bugfix PR fixes something introduced by another PR in the **same unreleased version**, it must NOT appear as a separate Bugfixes entry. Instead, silently fold it into the original feature/improvement entry. The changelog should describe the final shipped state, not the development history.
|
||||
- To detect this: check if the bugfix PR references or reverts changes from another PR in the same release cycle, or if it touches code that was newly added (not present in the previous release).
|
||||
|
||||
## Step 4: Review the actual code diff
|
||||
|
||||
**Do not rely on PR titles or descriptions alone.** For every candidate PR, run `gh pr diff <number> --repo quickwit-oss/tantivy` and read the actual changes. PR titles are often misleading — the diff is the source of truth.
|
||||
|
||||
**What to look for in the diff:**
|
||||
- Does it change observable behavior, public API surface, or performance characteristics?
|
||||
- Is the change something a user of the library would notice or need to know about?
|
||||
- Could the change break existing code (API changes, removed features)?
|
||||
|
||||
**Skip PRs where the diff reveals the change is not meaningful enough for the changelog** — e.g., cosmetic renames, trivial visibility tweaks, test-only changes, etc.
|
||||
|
||||
## Step 5: Categorize each PR group
|
||||
|
||||
For each PR (or consolidated group) that survived the diff review, determine its category:
|
||||
|
||||
- **Bugfixes** — fixes to behavior that existed in the **previous release**. NOT fixes to features introduced in this release cycle.
|
||||
- **Features/Improvements** — new features, API additions, new options, improvements that change user-facing behavior or add new capabilities.
|
||||
- **Performance** — optimizations, speed improvements, memory reductions. **If a PR adds new API whose primary purpose is enabling a performance optimization, categorize it as Performance, not Features.** The deciding question is: does a user benefit from this because of new functionality, or because things got faster/leaner? For example, a new trait method that exists solely to enable cheaper intersection ordering is Performance, not a Feature.
|
||||
|
||||
If a PR doesn't clearly fit any category (e.g., CI-only changes, internal refactors with no user-facing impact, dependency bumps with no behavior change), skip it — not everything belongs in the changelog.
|
||||
|
||||
When unclear, use your best judgment or ask the user.
|
||||
|
||||
## Step 6: Format entries
|
||||
|
||||
Each entry must follow this exact format:
|
||||
|
||||
```
|
||||
- Description [#NUMBER](https://github.com/quickwit-oss/tantivy/pull/NUMBER)(@author)
|
||||
```
|
||||
|
||||
Rules:
|
||||
- The description should be concise and describe the user-facing change (not the implementation). Describe the final shipped state, not the incremental development steps.
|
||||
- Use sub-categories with bold headers when multiple entries relate to the same area (e.g., `- **Aggregation**` with indented entries beneath). Follow the existing grouping style in the changelog.
|
||||
- Author is the GitHub username from the PR, prefixed with `@`. For consolidated entries, include all contributing authors.
|
||||
- For consolidated PRs, list all PR links in a single entry: `[#100](url) [#110](url)` (see existing entries for examples).
|
||||
|
||||
## Step 7: Present changes to the user
|
||||
|
||||
Show the user the proposed changelog entries grouped by category **before** editing the file. Ask for confirmation or adjustments.
|
||||
|
||||
## Step 8: Update CHANGELOG.md
|
||||
|
||||
Insert the new entries into the appropriate sections of the unreleased version block. If a section doesn't exist yet, create it following the order: Bugfixes, Features/Improvements, Performance.
|
||||
|
||||
Append new entries at the end of each section (before the next section header or version header).
|
||||
|
||||
## Step 9: Verify
|
||||
|
||||
Read back the updated unreleased section and display it to the user for final review.
|
||||
4
.github/dependabot.yml
vendored
4
.github/dependabot.yml
vendored
@@ -6,8 +6,6 @@ updates:
|
||||
interval: daily
|
||||
time: "20:00"
|
||||
open-pull-requests-limit: 10
|
||||
cooldown:
|
||||
default-days: 2
|
||||
|
||||
- package-ecosystem: "github-actions"
|
||||
directory: "/"
|
||||
@@ -15,5 +13,3 @@ updates:
|
||||
interval: daily
|
||||
time: "20:00"
|
||||
open-pull-requests-limit: 10
|
||||
cooldown:
|
||||
default-days: 2
|
||||
|
||||
19
.github/workflows/coverage.yml
vendored
19
.github/workflows/coverage.yml
vendored
@@ -4,9 +4,6 @@ on:
|
||||
push:
|
||||
branches: [main]
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
# Ensures that we cancel running jobs for the same PR / same workflow.
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
|
||||
@@ -15,20 +12,16 @@ concurrency:
|
||||
jobs:
|
||||
coverage:
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
|
||||
- uses: actions/checkout@v4
|
||||
- name: Install Rust
|
||||
run: rustup toolchain install nightly-2025-12-01 --profile minimal --component llvm-tools-preview
|
||||
- uses: Swatinem/rust-cache@c19371144df3bb44fab255c43d04cbc2ab54d1c4 # v2.9.1
|
||||
- uses: taiki-e/install-action@e4b3a0453201addddc06d3a72db90326aad87084 # cargo-llvm-cov
|
||||
run: rustup toolchain install nightly-2024-07-01 --profile minimal --component llvm-tools-preview
|
||||
- uses: Swatinem/rust-cache@v2
|
||||
- uses: taiki-e/install-action@cargo-llvm-cov
|
||||
- name: Generate code coverage
|
||||
run: cargo +nightly-2025-12-01 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
|
||||
run: cargo +nightly-2024-07-01 llvm-cov --all-features --workspace --doctests --lcov --output-path lcov.info
|
||||
- name: Upload coverage to Codecov
|
||||
uses: codecov/codecov-action@fb8b3582c8e4def4969c97caa2f19720cb33a72f # v7.0.0
|
||||
uses: codecov/codecov-action@v3
|
||||
continue-on-error: true
|
||||
with:
|
||||
token: ${{ secrets.CODECOV_TOKEN }} # not required for public repos
|
||||
|
||||
10
.github/workflows/long_running.yml
vendored
10
.github/workflows/long_running.yml
vendored
@@ -8,9 +8,6 @@ env:
|
||||
CARGO_TERM_COLOR: always
|
||||
NUM_FUNCTIONAL_TEST_ITERATIONS: 20000
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
# Ensures that we cancel running jobs for the same PR / same workflow.
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
|
||||
@@ -21,13 +18,10 @@ jobs:
|
||||
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
|
||||
- uses: actions/checkout@v4
|
||||
- name: Install stable
|
||||
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af # v1.0.7
|
||||
uses: actions-rs/toolchain@v1
|
||||
with:
|
||||
toolchain: stable
|
||||
profile: minimal
|
||||
|
||||
49
.github/workflows/scorecard.yml
vendored
49
.github/workflows/scorecard.yml
vendored
@@ -1,49 +0,0 @@
|
||||
name: OpenSSF Scorecard
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 0 * * 0'
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
jobs:
|
||||
analysis:
|
||||
name: Scorecards analysis
|
||||
runs-on: ubuntu-latest
|
||||
permissions:
|
||||
# Needed to upload the results to code-scanning dashboard.
|
||||
security-events: write
|
||||
# Needed to publish results
|
||||
id-token: write
|
||||
|
||||
steps:
|
||||
- name: 'Checkout code'
|
||||
uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
|
||||
with:
|
||||
persist-credentials: false
|
||||
|
||||
- name: 'Run analysis'
|
||||
uses: ossf/scorecard-action@4eaacf0543bb3f2c246792bd56e8cdeffafb205a # v2.4.3
|
||||
with:
|
||||
results_file: results.sarif
|
||||
results_format: sarif
|
||||
repo_token: ${{ secrets.GITHUB_TOKEN }}
|
||||
publish_results: true
|
||||
|
||||
# Upload the results as artifacts.
|
||||
- name: 'Upload artifact'
|
||||
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
|
||||
with:
|
||||
name: SARIF file
|
||||
path: results.sarif
|
||||
retention-days: 5
|
||||
|
||||
# Upload the results to GitHub's code scanning dashboard.
|
||||
- name: 'Upload to code-scanning'
|
||||
uses: github/codeql-action/upload-sarif@87557b9c84dde89fdd9b10e88954ac2f4248e463 # v4.36.1
|
||||
with:
|
||||
sarif_file: results.sarif
|
||||
58
.github/workflows/test.yml
vendored
58
.github/workflows/test.yml
vendored
@@ -9,9 +9,6 @@ on:
|
||||
env:
|
||||
CARGO_TERM_COLOR: always
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
# Ensures that we cancel running jobs for the same PR / same workflow.
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
|
||||
@@ -22,39 +19,35 @@ jobs:
|
||||
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
checks: write
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Install nightly
|
||||
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af # v1.0.7
|
||||
uses: actions-rs/toolchain@v1
|
||||
with:
|
||||
toolchain: nightly
|
||||
profile: minimal
|
||||
components: rustfmt
|
||||
- name: Install stable
|
||||
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af # v1.0.7
|
||||
uses: actions-rs/toolchain@v1
|
||||
with:
|
||||
toolchain: stable
|
||||
profile: minimal
|
||||
components: clippy
|
||||
|
||||
- uses: Swatinem/rust-cache@c19371144df3bb44fab255c43d04cbc2ab54d1c4 # v2.9.1
|
||||
- uses: Swatinem/rust-cache@v2
|
||||
|
||||
- name: Check Formatting
|
||||
run: cargo +nightly fmt --all -- --check
|
||||
|
||||
|
||||
- name: Check Stable Compilation
|
||||
run: cargo build --all-features
|
||||
|
||||
|
||||
|
||||
- name: Check Bench Compilation
|
||||
run: cargo +nightly bench --no-run --profile=dev --all-features
|
||||
|
||||
- uses: actions-rs/clippy-check@b5b5f21f4797c02da247df37026fcd0a5024aa4d # v1.0.7
|
||||
- uses: actions-rs/clippy-check@v1
|
||||
with:
|
||||
toolchain: stable
|
||||
token: ${{ secrets.GITHUB_TOKEN }}
|
||||
@@ -64,47 +57,30 @@ jobs:
|
||||
|
||||
runs-on: ubuntu-latest
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
strategy:
|
||||
matrix:
|
||||
features:
|
||||
- { label: "all", flags: "mmap,stopwords,lz4-compression,zstd-compression,failpoints,stemmer" }
|
||||
- { label: "quickwit", flags: "mmap,quickwit,failpoints" }
|
||||
- { label: "none", flags: "" }
|
||||
features: [
|
||||
{ label: "all", flags: "mmap,stopwords,lz4-compression,zstd-compression,failpoints" },
|
||||
{ label: "quickwit", flags: "mmap,quickwit,failpoints" }
|
||||
]
|
||||
|
||||
name: test-${{ matrix.features.label}}
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Install stable
|
||||
uses: actions-rs/toolchain@16499b5e05bf2e26879000db0c1d13f7e13fa3af # v1.0.7
|
||||
uses: actions-rs/toolchain@v1
|
||||
with:
|
||||
toolchain: stable
|
||||
profile: minimal
|
||||
override: true
|
||||
|
||||
- uses: taiki-e/install-action@56cc9adf3a3e2c23eafb56e8acaf9d0373cb845a # nextest
|
||||
- uses: Swatinem/rust-cache@c19371144df3bb44fab255c43d04cbc2ab54d1c4 # v2.9.1
|
||||
- uses: taiki-e/install-action@nextest
|
||||
- uses: Swatinem/rust-cache@v2
|
||||
|
||||
- name: Run tests
|
||||
run: |
|
||||
# if matrix.feature.flags is empty then run on --lib to avoid compiling examples
|
||||
# (as most of them rely on mmap) otherwise run all
|
||||
if [ -z "${{ matrix.features.flags }}" ]; then
|
||||
cargo +stable nextest run --lib --no-default-features --verbose --workspace
|
||||
else
|
||||
cargo +stable nextest run --features ${{ matrix.features.flags }} --no-default-features --verbose --workspace
|
||||
fi
|
||||
run: cargo +stable nextest run --features ${{ matrix.features.flags }} --verbose --workspace
|
||||
|
||||
- name: Run doctests
|
||||
run: |
|
||||
# if matrix.feature.flags is empty then run on --lib to avoid compiling examples
|
||||
# (as most of them rely on mmap) otherwise run all
|
||||
if [ -z "${{ matrix.features.flags }}" ]; then
|
||||
echo "no doctest for no feature flag"
|
||||
else
|
||||
cargo +stable test --doc --features ${{ matrix.features.flags }} --verbose --workspace
|
||||
fi
|
||||
run: cargo +stable test --doc --features ${{ matrix.features.flags }} --verbose --workspace
|
||||
|
||||
77
CHANGELOG.md
77
CHANGELOG.md
@@ -1,58 +1,3 @@
|
||||
Tantivy 0.26.1
|
||||
================================
|
||||
|
||||
## Performance
|
||||
- Fix quadratic runtime in nested term and composite aggregations: memory accounting scanned all parent buckets on every collect instead of just the current parent (@PSeitz @fulmicoton)
|
||||
|
||||
Tantivy 0.26 (Unreleased)
|
||||
================================
|
||||
|
||||
## Bugfixes
|
||||
- Align float query coercion during search with the columnar coercion rules [#2692](https://github.com/quickwit-oss/tantivy/pull/2692)(@fulmicoton)
|
||||
- Fix lenient elastic range queries with trailing closing parentheses [#2816](https://github.com/quickwit-oss/tantivy/pull/2816)(@evance-br)
|
||||
- Fix intersection `seek()` advancing below current doc id [#2812](https://github.com/quickwit-oss/tantivy/pull/2812)(@fulmicoton)
|
||||
- Fix phrase query prefixed with `*` [#2751](https://github.com/quickwit-oss/tantivy/pull/2751)(@Darkheir)
|
||||
- Fix `vint` buffer overflow during index creation [#2778](https://github.com/quickwit-oss/tantivy/pull/2778)(@rebasedming)
|
||||
- Fix integer overflow in `ExpUnrolledLinkedList` for large datasets [#2735](https://github.com/quickwit-oss/tantivy/pull/2735)(@mdashti)
|
||||
- Fix integer overflow in segment sorting and merge policy truncation [#2846](https://github.com/quickwit-oss/tantivy/pull/2846)(@anaslimem)
|
||||
- Fix merging of intermediate aggregation results [#2719](https://github.com/quickwit-oss/tantivy/pull/2719)(@PSeitz)
|
||||
- Fix deduplicate doc counts in term aggregation for multi-valued fields [#2854](https://github.com/quickwit-oss/tantivy/pull/2854)(@nuri-yoo)
|
||||
|
||||
## Features/Improvements
|
||||
- **Aggregation**
|
||||
- Add filter aggregation [#2711](https://github.com/quickwit-oss/tantivy/pull/2711)(@mdashti)
|
||||
- Add include/exclude filtering for term aggregations [#2717](https://github.com/quickwit-oss/tantivy/pull/2717)(@PSeitz)
|
||||
- Add public accessors for intermediate aggregation results [#2829](https://github.com/quickwit-oss/tantivy/pull/2829)(@congx4)
|
||||
- Replace HyperLogLog++ with Apache DataSketches HLL for cardinality aggregation [#2837](https://github.com/quickwit-oss/tantivy/pull/2837) [#2842](https://github.com/quickwit-oss/tantivy/pull/2842)(@congx4)
|
||||
- Add composite aggregation [#2856](https://github.com/quickwit-oss/tantivy/pull/2856)(@fulmicoton)
|
||||
- **Fast Fields**
|
||||
- Add fast field fallback for `TermQuery` when the field is not indexed [#2693](https://github.com/quickwit-oss/tantivy/pull/2693)(@PSeitz-dd)
|
||||
- Add fast field support for `Bytes` values [#2830](https://github.com/quickwit-oss/tantivy/pull/2830)(@mdashti)
|
||||
- **Query Parser**
|
||||
- Add support for regexes in the query grammar [#2677](https://github.com/quickwit-oss/tantivy/pull/2677) [#2818](https://github.com/quickwit-oss/tantivy/pull/2818)(@Darkheir)
|
||||
- Deduplicate queries in query parser [#2698](https://github.com/quickwit-oss/tantivy/pull/2698)(@PSeitz-dd)
|
||||
- Add erased `SortKeyComputer` for sorting on column types unknown until runtime [#2770](https://github.com/quickwit-oss/tantivy/pull/2770) [#2790](https://github.com/quickwit-oss/tantivy/pull/2790)(@stuhood @PSeitz)
|
||||
- Add natural-order-with-none-highest support in `TopDocs::order_by` [#2780](https://github.com/quickwit-oss/tantivy/pull/2780)(@stuhood)
|
||||
- Move stemming behing `stemmer` feature flag [#2791](https://github.com/quickwit-oss/tantivy/pull/2791)(@fulmicoton)
|
||||
- Make `DeleteMeta`, `AddOperation`, `advance_deletes`, `with_max_doc`, `serializer` module, and `delete_queue` public [#2762](https://github.com/quickwit-oss/tantivy/pull/2762) [#2765](https://github.com/quickwit-oss/tantivy/pull/2765) [#2766](https://github.com/quickwit-oss/tantivy/pull/2766) [#2835](https://github.com/quickwit-oss/tantivy/pull/2835)(@philippemnoel @PSeitz)
|
||||
- Make `Language` hashable [#2763](https://github.com/quickwit-oss/tantivy/pull/2763)(@philippemnoel)
|
||||
- Improve `space_usage` reporting for JSON fields and columnar data [#2761](https://github.com/quickwit-oss/tantivy/pull/2761)(@PSeitz-dd)
|
||||
- Split `Term` into `Term` and `IndexingTerm` [#2744](https://github.com/quickwit-oss/tantivy/pull/2744) [#2750](https://github.com/quickwit-oss/tantivy/pull/2750)(@PSeitz-dd @PSeitz)
|
||||
|
||||
## Performance
|
||||
- **Aggregation**
|
||||
- Large speed up and memory reduction for nested high cardinality aggregations by using one collector per request instead of one per bucket, and adding `PagedTermMap` for faster medium cardinality term aggregations [#2715](https://github.com/quickwit-oss/tantivy/pull/2715) [#2759](https://github.com/quickwit-oss/tantivy/pull/2759)(@PSeitz @PSeitz-dd)
|
||||
- Optimize low-cardinality term aggregations by using a `Vec` instead of a `HashMap` [#2740](https://github.com/quickwit-oss/tantivy/pull/2740)(@fulmicoton-dd)
|
||||
- Optimize `ExistsQuery` for a high number of dynamic columns [#2694](https://github.com/quickwit-oss/tantivy/pull/2694)(@PSeitz-dd)
|
||||
- Add lazy scorers to stop score evaluation early when a doc won't reach the top-K threshold [#2726](https://github.com/quickwit-oss/tantivy/pull/2726) [#2777](https://github.com/quickwit-oss/tantivy/pull/2777)(@fulmicoton @stuhood)
|
||||
- Add `DocSet::cost()` and use it to order scorers in intersections [#2707](https://github.com/quickwit-oss/tantivy/pull/2707)(@PSeitz)
|
||||
- Add `collect_block` support for collector wrappers [#2727](https://github.com/quickwit-oss/tantivy/pull/2727)(@stuhood)
|
||||
- Optimize saturated posting lists by replacing them with `AllScorer` in boolean queries [#2745](https://github.com/quickwit-oss/tantivy/pull/2745) [#2760](https://github.com/quickwit-oss/tantivy/pull/2760) [#2774](https://github.com/quickwit-oss/tantivy/pull/2774)(@fulmicoton @mdashti @trinity-1686a)
|
||||
- Add `seek_danger` on `DocSet` for more efficient intersections [#2538](https://github.com/quickwit-oss/tantivy/pull/2538) [#2810](https://github.com/quickwit-oss/tantivy/pull/2810)(@PSeitz @stuhood @fulmicoton)
|
||||
- Skip column traversal in `RangeDocSet` when query range does not overlap with column bounds [#2783](https://github.com/quickwit-oss/tantivy/pull/2783)(@ChangRui-Ryan)
|
||||
- Speed up exclude queries by supporting multiple excluded `DocSet`s without intermediate union [#2825](https://github.com/quickwit-oss/tantivy/pull/2825)(@PSeitz)
|
||||
- Improve union performance for non-score unions with `fill_buffer` and optimized `TinySet` [#2863](https://github.com/quickwit-oss/tantivy/pull/2863)(@PSeitz)
|
||||
|
||||
Tantivy 0.25
|
||||
================================
|
||||
|
||||
@@ -69,18 +14,6 @@ Tantivy 0.25
|
||||
- Support mixed field types in query parser [#2676](https://github.com/quickwit-oss/tantivy/pull/2676)(@trinity-1686a)
|
||||
- Add per-field size details [#2679](https://github.com/quickwit-oss/tantivy/pull/2679)(@fulmicoton)
|
||||
|
||||
Tantivy 0.24.2
|
||||
================================
|
||||
- Fix TopNComputer for reverse order. [#2672](https://github.com/quickwit-oss/tantivy/pull/2672)(@stuhood @PSeitz)
|
||||
|
||||
Affected queries are [order_by_fast_field](https://docs.rs/tantivy/latest/tantivy/collector/struct.TopDocs.html#method.order_by_fast_field) and
|
||||
[order_by_u64_field](https://docs.rs/tantivy/latest/tantivy/collector/struct.TopDocs.html#method.order_by_u64_field)
|
||||
for `Order::Asc`
|
||||
|
||||
Tantivy 0.24.1
|
||||
================================
|
||||
- Fix: bump required rust version to 1.81
|
||||
|
||||
Tantivy 0.24
|
||||
================================
|
||||
Tantivy 0.24 will be backwards compatible with indices created with v0.22 and v0.21. The new minimum rust version will be 1.75. Tantivy 0.23 will be skipped.
|
||||
@@ -133,7 +66,7 @@ This will slightly increase space and access time. [#2439](https://github.com/qu
|
||||
|
||||
- **Store DateTime as nanoseconds in doc store** DateTime in the doc store was truncated to microseconds previously. This removes this truncation, while still keeping backwards compatibility. [#2486](https://github.com/quickwit-oss/tantivy/pull/2486)(@PSeitz)
|
||||
|
||||
- **Performance/Memory**
|
||||
- **Performace/Memory**
|
||||
- lift clauses in LogicalAst for optimized ast during execution [#2449](https://github.com/quickwit-oss/tantivy/pull/2449)(@PSeitz)
|
||||
- Use Vec instead of BTreeMap to back OwnedValue object [#2364](https://github.com/quickwit-oss/tantivy/pull/2364)(@fulmicoton)
|
||||
- Replace TantivyDocument with CompactDoc. CompactDoc is much smaller and provides similar performance. [#2402](https://github.com/quickwit-oss/tantivy/pull/2402)(@PSeitz)
|
||||
@@ -163,14 +96,6 @@ This will slightly increase space and access time. [#2439](https://github.com/qu
|
||||
- Fix trait bound of StoreReader::iter [#2360](https://github.com/quickwit-oss/tantivy/pull/2360)(@adamreichold)
|
||||
- remove read_postings_no_deletes [#2526](https://github.com/quickwit-oss/tantivy/pull/2526)(@PSeitz)
|
||||
|
||||
Tantivy 0.22.1
|
||||
================================
|
||||
- Fix TopNComputer for reverse order. [#2672](https://github.com/quickwit-oss/tantivy/pull/2672)(@stuhood @PSeitz)
|
||||
|
||||
Affected queries are [order_by_fast_field](https://docs.rs/tantivy/latest/tantivy/collector/struct.TopDocs.html#method.order_by_fast_field) and
|
||||
[order_by_u64_field](https://docs.rs/tantivy/latest/tantivy/collector/struct.TopDocs.html#method.order_by_u64_field)
|
||||
for `Order::Asc`
|
||||
|
||||
Tantivy 0.22
|
||||
================================
|
||||
|
||||
|
||||
92
Cargo.toml
92
Cargo.toml
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy"
|
||||
version = "0.26.0"
|
||||
version = "0.25.0"
|
||||
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
|
||||
license = "MIT"
|
||||
categories = ["database-implementations", "data-structures"]
|
||||
@@ -11,11 +11,11 @@ repository = "https://github.com/quickwit-oss/tantivy"
|
||||
readme = "README.md"
|
||||
keywords = ["search", "information", "retrieval"]
|
||||
edition = "2021"
|
||||
rust-version = "1.86"
|
||||
rust-version = "1.85"
|
||||
exclude = ["benches/*.json", "benches/*.txt"]
|
||||
|
||||
[dependencies]
|
||||
oneshot = "0.1.13"
|
||||
oneshot = "0.1.7"
|
||||
base64 = "0.22.0"
|
||||
byteorder = "1.4.3"
|
||||
crc32fast = "1.3.2"
|
||||
@@ -27,7 +27,7 @@ regex = { version = "1.5.5", default-features = false, features = [
|
||||
aho-corasick = "1.0"
|
||||
tantivy-fst = "0.5"
|
||||
memmap2 = { version = "0.9.0", optional = true }
|
||||
lz4_flex = { version = "0.13", default-features = false, optional = true }
|
||||
lz4_flex = { version = "0.11", default-features = false, optional = true }
|
||||
zstd = { version = "0.13", optional = true, default-features = false }
|
||||
tempfile = { version = "3.12.0", optional = true }
|
||||
log = "0.4.16"
|
||||
@@ -37,9 +37,9 @@ fs4 = { version = "0.13.1", optional = true }
|
||||
levenshtein_automata = "0.2.1"
|
||||
uuid = { version = "1.0.0", features = ["v4", "serde"] }
|
||||
crossbeam-channel = "0.5.4"
|
||||
rust-stemmers = { version = "1.2.0", optional = true }
|
||||
rust-stemmers = "1.2.0"
|
||||
downcast-rs = "2.0.1"
|
||||
bitpacking = { version = "0.9.3", default-features = false, features = [
|
||||
bitpacking = { version = "0.9.2", default-features = false, features = [
|
||||
"bitpacker4x",
|
||||
] }
|
||||
census = "0.4.2"
|
||||
@@ -47,52 +47,51 @@ rustc-hash = "2.0.0"
|
||||
thiserror = "2.0.1"
|
||||
htmlescape = "0.3.1"
|
||||
fail = { version = "0.5.0", optional = true }
|
||||
time = { version = "0.3.47", features = ["serde-well-known"] }
|
||||
time = { version = "0.3.35", features = ["serde-well-known"] }
|
||||
smallvec = "1.8.0"
|
||||
rayon = "1.5.2"
|
||||
lru = "0.16.3"
|
||||
lru = "0.12.0"
|
||||
fastdivide = "0.4.0"
|
||||
itertools = "0.14.0"
|
||||
measure_time = "0.9.0"
|
||||
arc-swap = "1.5.0"
|
||||
bon = "3.3.1"
|
||||
|
||||
columnar = { version = "0.7", path = "./columnar", package = "tantivy-columnar" }
|
||||
sstable = { version = "0.7", path = "./sstable", package = "tantivy-sstable", optional = true }
|
||||
stacker = { version = "0.7", path = "./stacker", package = "tantivy-stacker" }
|
||||
query-grammar = { version = "0.26.0", path = "./query-grammar", package = "tantivy-query-grammar" }
|
||||
tantivy-bitpacker = { version = "0.10", path = "./bitpacker" }
|
||||
common = { version = "0.11", path = "./common/", package = "tantivy-common" }
|
||||
tokenizer-api = { version = "0.7", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
|
||||
sketches-ddsketch = { version = "0.4", features = ["use_serde"] }
|
||||
datasketches = { version = "0.3.0", features = ["hll"] }
|
||||
columnar = { version = "0.6", path = "./columnar", package = "tantivy-columnar" }
|
||||
sstable = { version = "0.6", path = "./sstable", package = "tantivy-sstable", optional = true }
|
||||
stacker = { version = "0.6", path = "./stacker", package = "tantivy-stacker" }
|
||||
query-grammar = { version = "0.25.0", path = "./query-grammar", package = "tantivy-query-grammar" }
|
||||
tantivy-bitpacker = { version = "0.9", path = "./bitpacker" }
|
||||
common = { version = "0.10", path = "./common/", package = "tantivy-common" }
|
||||
tokenizer-api = { version = "0.6", path = "./tokenizer-api", package = "tantivy-tokenizer-api" }
|
||||
sketches-ddsketch = { version = "0.3.0", features = ["use_serde"] }
|
||||
hyperloglogplus = { version = "0.4.1", features = ["const-loop"] }
|
||||
futures-util = { version = "0.3.28", optional = true }
|
||||
futures-channel = { version = "0.3.28", optional = true }
|
||||
fnv = "1.0.7"
|
||||
typetag = "0.2.21"
|
||||
|
||||
[target.'cfg(windows)'.dependencies]
|
||||
winapi = "0.3.9"
|
||||
|
||||
[dev-dependencies]
|
||||
binggan = "0.17.0"
|
||||
rand = "0.9"
|
||||
binggan = "0.15.3"
|
||||
rand = "0.8.5"
|
||||
maplit = "1.0.2"
|
||||
matches = "0.1.9"
|
||||
pretty_assertions = "1.2.1"
|
||||
proptest = "1.7.0"
|
||||
proptest = "1.0.0"
|
||||
test-log = "0.2.10"
|
||||
futures = "0.3.21"
|
||||
paste = "1.0.11"
|
||||
more-asserts = "0.3.1"
|
||||
rand_distr = "0.5"
|
||||
time = { version = "0.3.47", features = ["serde-well-known", "macros"] }
|
||||
rand_distr = "0.4.3"
|
||||
time = { version = "0.3.10", features = ["serde-well-known", "macros"] }
|
||||
postcard = { version = "1.0.4", features = [
|
||||
"use-std",
|
||||
"use-std",
|
||||
], default-features = false }
|
||||
|
||||
[target.'cfg(not(windows))'.dev-dependencies]
|
||||
criterion = { version = "0.8", default-features = false }
|
||||
criterion = { version = "0.5", default-features = false }
|
||||
|
||||
[dev-dependencies.fail]
|
||||
version = "0.5.0"
|
||||
@@ -113,8 +112,7 @@ debug-assertions = true
|
||||
overflow-checks = true
|
||||
|
||||
[features]
|
||||
default = ["mmap", "stopwords", "lz4-compression", "columnar-zstd-compression", "stemmer"]
|
||||
stemmer = ["rust-stemmers"]
|
||||
default = ["mmap", "stopwords", "lz4-compression", "columnar-zstd-compression"]
|
||||
mmap = ["fs4", "tempfile", "memmap2"]
|
||||
stopwords = []
|
||||
|
||||
@@ -169,43 +167,3 @@ harness = false
|
||||
[[bench]]
|
||||
name = "agg_bench"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "exists_json"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "range_query"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "and_or_queries"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "range_queries"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "bool_queries_with_range"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "str_search_and_get"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "merge_segments"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "regex_all_terms"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "query_parser_nested"
|
||||
harness = false
|
||||
|
||||
[[bench]]
|
||||
name = "intersection_bench"
|
||||
harness = false
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
[](https://docs.rs/crate/tantivy/)
|
||||
[](https://github.com/quickwit-oss/tantivy/actions/workflows/test.yml)
|
||||
[](https://codecov.io/gh/quickwit-oss/tantivy)
|
||||
[](https://scorecard.dev/viewer/?uri=github.com/quickwit-oss/tantivy)
|
||||
[](https://discord.gg/MT27AG5EVE)
|
||||
[](https://opensource.org/licenses/MIT)
|
||||
[](https://crates.io/crates/tantivy)
|
||||
@@ -24,6 +23,8 @@ performance for different types of queries/collections.
|
||||
|
||||
Your mileage WILL vary depending on the nature of queries and their load.
|
||||
|
||||
<img src="doc/assets/images/searchbenchmark.png">
|
||||
|
||||
Details about the benchmark can be found at this [repository](https://github.com/quickwit-oss/search-benchmark-game).
|
||||
|
||||
## Features
|
||||
@@ -124,7 +125,6 @@ You can also find other bindings on [GitHub](https://github.com/search?q=tantivy
|
||||
- [seshat](https://github.com/matrix-org/seshat/): A matrix message database/indexer
|
||||
- [tantiny](https://github.com/baygeldin/tantiny): Tiny full-text search for Ruby
|
||||
- [lnx](https://github.com/lnx-search/lnx): adaptable, typo tolerant search engine with a REST API
|
||||
- [Bichon](https://github.com/rustmailer/bichon): A lightweight, high-performance Rust email archiver with WebUI
|
||||
- and [more](https://github.com/search?q=tantivy)!
|
||||
|
||||
### On average, how much faster is Tantivy compared to Lucene?
|
||||
|
||||
2
TODO.txt
2
TODO.txt
@@ -10,7 +10,7 @@ rename FastFieldReaders::open to load
|
||||
remove fast field reader
|
||||
|
||||
find a way to unify the two DateTime.
|
||||
re-add type check in the filter wrapper
|
||||
readd type check in the filter wrapper
|
||||
|
||||
add unit test on columnar list columns.
|
||||
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
use binggan::plugins::PeakMemAllocPlugin;
|
||||
use binggan::{black_box, InputGroup, PeakMemAlloc, INSTRUMENTED_SYSTEM};
|
||||
use rand::distr::weighted::WeightedIndex;
|
||||
use rand::distributions::WeightedIndex;
|
||||
use rand::prelude::SliceRandom;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::seq::IndexedRandom;
|
||||
use rand::{Rng, SeedableRng};
|
||||
use rand_distr::Distribution;
|
||||
use serde_json::json;
|
||||
@@ -10,7 +10,7 @@ use tantivy::aggregation::agg_req::Aggregations;
|
||||
use tantivy::aggregation::AggregationCollector;
|
||||
use tantivy::query::{AllQuery, TermQuery};
|
||||
use tantivy::schema::{IndexRecordOption, Schema, TextFieldIndexing, FAST, STRING};
|
||||
use tantivy::{doc, DateTime, Index, Term};
|
||||
use tantivy::{doc, Index, Term};
|
||||
|
||||
#[global_allocator]
|
||||
pub static GLOBAL: &PeakMemAlloc<std::alloc::System> = &INSTRUMENTED_SYSTEM;
|
||||
@@ -63,31 +63,18 @@ fn bench_agg(mut group: InputGroup<Index>) {
|
||||
register!(group, terms_all_unique_with_avg_sub_agg);
|
||||
register!(group, terms_many_with_avg_sub_agg);
|
||||
register!(group, terms_status_with_avg_sub_agg);
|
||||
register!(group, terms_status_with_terms_zipf_1000_sub_agg);
|
||||
register!(group, terms_zipf_1000_with_terms_status_sub_agg);
|
||||
register!(group, terms_status_with_histogram);
|
||||
register!(group, terms_status_with_date_histogram);
|
||||
register!(group, terms_status_with_date_histogram_hard_bounds);
|
||||
register!(group, terms_status_with_date_histogram_and_sibling_terms);
|
||||
register!(group, terms_zipf_1000);
|
||||
register!(group, terms_zipf_1000_with_histogram);
|
||||
register!(group, terms_zipf_1000_with_avg_sub_agg);
|
||||
|
||||
register!(group, terms_many_json_mixed_type_with_avg_sub_agg);
|
||||
|
||||
register!(group, composite_term_many_page_1000);
|
||||
register!(group, composite_term_many_page_1000_with_avg_sub_agg);
|
||||
register!(group, composite_term_few);
|
||||
register!(group, composite_histogram);
|
||||
register!(group, composite_histogram_calendar);
|
||||
// composite aggregations not available in 0.25
|
||||
// filter aggregations not available in 0.25
|
||||
|
||||
register!(group, cardinality_agg);
|
||||
register!(group, cardinality_agg_high_card);
|
||||
register!(group, cardinality_agg_low_card);
|
||||
register!(group, terms_status_with_cardinality_agg);
|
||||
register!(group, terms_100_buckets_with_cardinality_agg);
|
||||
register!(group, terms_many_with_single_term_order_by_card);
|
||||
register!(group, terms_many_with_single_term_2_order_by_card);
|
||||
|
||||
register!(group, range_agg);
|
||||
register!(group, range_agg_with_avg_sub_agg);
|
||||
@@ -99,12 +86,6 @@ fn bench_agg(mut group: InputGroup<Index>) {
|
||||
register!(group, histogram_with_term_agg_status);
|
||||
register!(group, avg_and_range_with_avg_sub_agg);
|
||||
|
||||
// Filter aggregation benchmarks
|
||||
register!(group, filter_agg_all_query_count_agg);
|
||||
register!(group, filter_agg_term_query_count_agg);
|
||||
register!(group, filter_agg_all_query_with_sub_aggs);
|
||||
register!(group, filter_agg_term_query_with_sub_aggs);
|
||||
|
||||
group.run();
|
||||
}
|
||||
|
||||
@@ -175,52 +156,10 @@ fn cardinality_agg(index: &Index) {
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
// Full-scan cardinality on a near-1M-cardinality string field.
|
||||
// Hits the dense (PagedBitset) path: every doc has a unique term,
|
||||
// so the bucket promotes from FxHashSet shortly into the scan.
|
||||
fn cardinality_agg_high_card(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "text_all_unique_terms"
|
||||
},
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
// Full-scan cardinality on a tiny-cardinality string field (7 distinct
|
||||
// values). Stays on the FxHashSet path — the promotion threshold is
|
||||
// never crossed. Validates no regression on the sparse path.
|
||||
fn cardinality_agg_low_card(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "text_few_terms_status"
|
||||
},
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn terms_status_with_cardinality_agg(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "text_few_terms_status" },
|
||||
"aggs": {
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
"field": "text_few_terms_status"
|
||||
},
|
||||
}
|
||||
}
|
||||
},
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn terms_100_buckets_with_cardinality_agg(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "text_1000_terms_zipf", "size": 100 },
|
||||
"aggs": {
|
||||
"cardinality": {
|
||||
"cardinality": {
|
||||
@@ -233,58 +172,6 @@ fn terms_100_buckets_with_cardinality_agg(index: &Index) {
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn terms_many_with_single_term_order_by_card(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "text_many_terms" },
|
||||
"aggs": {
|
||||
"nested_terms": {
|
||||
"terms": {
|
||||
"field": "single_term",
|
||||
"order": { "cardinality": "desc" }
|
||||
},
|
||||
"aggs": {
|
||||
"cardinality": {
|
||||
"cardinality": { "field": "text_few_terms" }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
// Two-level terms ordered by cardinality at each level: a high-card outer terms
|
||||
// (text_many_terms) ordered by a cardinality sub-agg, with a nested low-card terms
|
||||
// (text_few_terms_status) also ordered by a cardinality sub-agg, plus an avg.
|
||||
fn terms_many_with_single_term_2_order_by_card(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"by_ip": {
|
||||
"terms": {
|
||||
"field": "text_many_terms",
|
||||
"order": { "card_few_terms": "desc" }
|
||||
},
|
||||
"aggs": {
|
||||
"card_few_terms": {
|
||||
"cardinality": { "field": "text_few_terms" }
|
||||
},
|
||||
"nested_terms": {
|
||||
"terms": {
|
||||
"field": " single_term",
|
||||
"order": { "distinct_path2": "desc" }
|
||||
},
|
||||
"aggs": {
|
||||
"avg_botscore": { "avg": { "field": "score" } },
|
||||
"distinct_path2": { "cardinality": { "field": "text_few_terms" } }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn terms_7(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": { "terms": { "field": "text_few_terms_status" } },
|
||||
@@ -357,30 +244,6 @@ fn terms_all_unique_with_avg_sub_agg(index: &Index) {
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn terms_status_with_terms_zipf_1000_sub_agg(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "text_few_terms_status" },
|
||||
"aggs": {
|
||||
"nested_terms": { "terms": { "field": "text_1000_terms_zipf" } }
|
||||
}
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn terms_zipf_1000_with_terms_status_sub_agg(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "text_1000_terms_zipf" },
|
||||
"aggs": {
|
||||
"nested_terms": { "terms": { "field": "text_few_terms_status" } }
|
||||
}
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn terms_status_with_histogram(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
@@ -393,57 +256,6 @@ fn terms_status_with_histogram(index: &Index) {
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn terms_status_with_date_histogram(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "text_few_terms_status" },
|
||||
"aggs": {
|
||||
"over_time": { "date_histogram": { "field": "timestamp", "fixed_interval": "1h" } }
|
||||
}
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
/// Same fused terms × date_histogram, but with `hard_bounds`. The timestamps span 0..120h; the
|
||||
/// bounds drop only the first and last hour (ms: 1h=3_600_000, 119h=428_400_000), so almost every
|
||||
/// doc is in-bounds. This exercises the collector's hard-bounds path: `bounds.contains` runs per
|
||||
/// doc (the `all_docs_in_bounds` short-circuit is off) and the rare out-of-bounds doc takes the
|
||||
/// `term_counts` branch.
|
||||
fn terms_status_with_date_histogram_hard_bounds(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "text_few_terms_status" },
|
||||
"aggs": {
|
||||
"over_time": {
|
||||
"date_histogram": {
|
||||
"field": "timestamp",
|
||||
"fixed_interval": "1h",
|
||||
"hard_bounds": { "min": 3_600_000, "max": 428_400_000 }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
/// Same fused terms × date_histogram, but with a sibling terms aggregation next to it. The fused
|
||||
/// fast path should still trigger for `my_texts` (sibling aggregations are independent top-level
|
||||
/// aggregations, so they don't change its eligibility).
|
||||
fn terms_status_with_date_histogram_and_sibling_terms(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
"terms": { "field": "text_few_terms_status" },
|
||||
"aggs": {
|
||||
"over_time": { "date_histogram": { "field": "timestamp", "fixed_interval": "1h" } }
|
||||
}
|
||||
},
|
||||
"other_texts": { "terms": { "field": "text_few_terms" } }
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn terms_zipf_1000_with_histogram(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_texts": {
|
||||
@@ -499,75 +311,6 @@ fn terms_many_json_mixed_type_with_avg_sub_agg(index: &Index) {
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn composite_term_few(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_ctf": {
|
||||
"composite": {
|
||||
"sources": [
|
||||
{ "text_few_terms": { "terms": { "field": "text_few_terms" } } }
|
||||
],
|
||||
"size": 1000
|
||||
}
|
||||
},
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn composite_term_many_page_1000(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_ctmp1000": {
|
||||
"composite": {
|
||||
"sources": [
|
||||
{ "text_many_terms": { "terms": { "field": "text_many_terms" } } }
|
||||
],
|
||||
"size": 1000
|
||||
}
|
||||
},
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn composite_term_many_page_1000_with_avg_sub_agg(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_ctmp1000wasa": {
|
||||
"composite": {
|
||||
"sources": [
|
||||
{ "text_many_terms": { "terms": { "field": "text_many_terms" } } }
|
||||
],
|
||||
"size": 1000,
|
||||
},
|
||||
"aggs": {
|
||||
"average_f64": { "avg": { "field": "score_f64" } }
|
||||
}
|
||||
},
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn composite_histogram(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_ch": {
|
||||
"composite": {
|
||||
"sources": [
|
||||
{ "f64_histogram": { "histogram": { "field": "score_f64", "interval": 1 } } }
|
||||
],
|
||||
"size": 1000
|
||||
}
|
||||
},
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
fn composite_histogram_calendar(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"my_chc": {
|
||||
"composite": {
|
||||
"sources": [
|
||||
{ "time_histogram": { "date_histogram": { "field": "timestamp", "calendar_interval": "month" } } }
|
||||
],
|
||||
"size": 1000
|
||||
}
|
||||
},
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn execute_agg(index: &Index, agg_req: serde_json::Value) {
|
||||
let agg_req: Aggregations = serde_json::from_value(agg_req).unwrap();
|
||||
let collector = get_collector(agg_req);
|
||||
@@ -737,7 +480,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
if reuse_index && std::path::Path::new("agg_bench").exists() {
|
||||
return Index::open_in_dir("agg_bench");
|
||||
}
|
||||
// crreate dir
|
||||
// create dir
|
||||
std::fs::create_dir_all("agg_bench")?;
|
||||
let mut schema_builder = Schema::builder();
|
||||
let text_fieldtype = tantivy::schema::TextOptions::default()
|
||||
@@ -745,8 +488,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
TextFieldIndexing::default().set_index_option(IndexRecordOption::WithFreqs),
|
||||
)
|
||||
.set_stored();
|
||||
let text_field = schema_builder.add_text_field("text", text_fieldtype.clone());
|
||||
let single_term = schema_builder.add_text_field("single_term", FAST);
|
||||
let text_field = schema_builder.add_text_field("text", text_fieldtype);
|
||||
let json_field = schema_builder.add_json_field("json", FAST);
|
||||
let text_field_all_unique_terms =
|
||||
schema_builder.add_text_field("text_all_unique_terms", STRING | FAST);
|
||||
@@ -760,7 +502,6 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
let score_field = schema_builder.add_u64_field("score", score_fieldtype.clone());
|
||||
let score_field_f64 = schema_builder.add_f64_field("score_f64", score_fieldtype.clone());
|
||||
let score_field_i64 = schema_builder.add_i64_field("score_i64", score_fieldtype);
|
||||
let date_field = schema_builder.add_date_field("timestamp", FAST);
|
||||
// use tmp dir
|
||||
let index = if reuse_index {
|
||||
Index::create_in_dir("agg_bench", schema_builder.build())?
|
||||
@@ -769,7 +510,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
};
|
||||
// Approximate log proportions
|
||||
let status_field_data = [
|
||||
("INFO", 8000),
|
||||
("INFO", 8000u32),
|
||||
("ERROR", 300),
|
||||
("WARN", 1200),
|
||||
("DEBUG", 500),
|
||||
@@ -790,7 +531,7 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
// Prepare 1000 unique terms sampled using a Zipf distribution.
|
||||
// Exponent ~1.1 approximates top-20 terms covering around ~20%.
|
||||
let terms_1000: Vec<String> = (1..=1000).map(|i| format!("term_{i}")).collect();
|
||||
let zipf_1000 = rand_distr::Zipf::new(1000.0, 1.1f64).unwrap();
|
||||
let zipf_1000 = rand_distr::Zipf::new(1000, 1.1f64).unwrap();
|
||||
|
||||
{
|
||||
let mut rng = StdRng::from_seed([1u8; 32]);
|
||||
@@ -810,8 +551,6 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
index_writer.add_document(doc!(
|
||||
json_field => json!({"mixed_type": 10.0}),
|
||||
json_field => json!({"mixed_type": 10.0}),
|
||||
single_term => "single_term",
|
||||
single_term => "single_term",
|
||||
text_field => "cool",
|
||||
text_field => "cool",
|
||||
text_field_all_unique_terms => "cool",
|
||||
@@ -837,24 +576,18 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
doc_with_value /= 20;
|
||||
}
|
||||
let _val_max = 1_000_000.0;
|
||||
const SPAN_MS: i64 = 120 * 3600 * 1000; // 120 hours in ms
|
||||
const NOISE_MS: i64 = 2 * 3600 * 1000; // ±2h noise
|
||||
for i in 0..doc_with_value {
|
||||
let val: f64 = rng.random_range(0.0..1_000_000.0);
|
||||
let json = if rng.random_bool(0.1) {
|
||||
for _ in 0..doc_with_value {
|
||||
let val: f64 = rng.gen_range(0.0..1_000_000.0);
|
||||
let json = if rng.gen_bool(0.1) {
|
||||
// 10% are numeric values
|
||||
json!({ "mixed_type": val })
|
||||
} else {
|
||||
json!({"mixed_type": many_terms_data.choose(&mut rng).unwrap().to_string()})
|
||||
};
|
||||
let base_ms = (i as i64 * SPAN_MS) / doc_with_value as i64;
|
||||
let noise_ms = rng.random_range(-NOISE_MS..NOISE_MS);
|
||||
let ts_ms = (base_ms + noise_ms).clamp(0, SPAN_MS);
|
||||
index_writer.add_document(doc!(
|
||||
single_term => "single_term",
|
||||
text_field => "cool",
|
||||
json_field => json,
|
||||
text_field_all_unique_terms => format!("unique_term_{}", rng.random::<u64>()),
|
||||
text_field_all_unique_terms => format!("unique_term_{}", rng.gen::<u64>()),
|
||||
text_field_many_terms => many_terms_data.choose(&mut rng).unwrap().to_string(),
|
||||
text_field_few_terms => few_terms_data.choose(&mut rng).unwrap().to_string(),
|
||||
text_field_few_terms_status => status_field_data[log_level_distribution.sample(&mut rng)].0,
|
||||
@@ -862,7 +595,6 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
score_field => val as u64,
|
||||
score_field_f64 => lg_norm.sample(&mut rng),
|
||||
score_field_i64 => val as i64,
|
||||
date_field => DateTime::from_timestamp_millis(ts_ms),
|
||||
))?;
|
||||
if cardinality == Cardinality::OptionalSparse {
|
||||
for _ in 0..20 {
|
||||
@@ -876,61 +608,3 @@ fn get_test_index_bench(cardinality: Cardinality) -> tantivy::Result<Index> {
|
||||
|
||||
Ok(index)
|
||||
}
|
||||
|
||||
// Filter aggregation benchmarks
|
||||
|
||||
fn filter_agg_all_query_count_agg(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"filtered": {
|
||||
"filter": "*",
|
||||
"aggs": {
|
||||
"count": { "value_count": { "field": "score" } }
|
||||
}
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn filter_agg_term_query_count_agg(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"filtered": {
|
||||
"filter": "text:cool",
|
||||
"aggs": {
|
||||
"count": { "value_count": { "field": "score" } }
|
||||
}
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn filter_agg_all_query_with_sub_aggs(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"filtered": {
|
||||
"filter": "*",
|
||||
"aggs": {
|
||||
"avg_score": { "avg": { "field": "score" } },
|
||||
"stats_score": { "stats": { "field": "score_f64" } },
|
||||
"terms_text": {
|
||||
"terms": { "field": "text_few_terms_status" }
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
fn filter_agg_term_query_with_sub_aggs(index: &Index) {
|
||||
let agg_req = json!({
|
||||
"filtered": {
|
||||
"filter": "text:cool",
|
||||
"aggs": {
|
||||
"avg_score": { "avg": { "field": "score" } },
|
||||
"stats_score": { "stats": { "field": "score_f64" } },
|
||||
"terms_text": {
|
||||
"terms": { "field": "text_few_terms_status" }
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
execute_agg(index, agg_req);
|
||||
}
|
||||
|
||||
@@ -1,249 +0,0 @@
|
||||
// Benchmarks boolean conjunction queries using binggan.
|
||||
//
|
||||
// What’s measured:
|
||||
// - Or and And queries with varying selectivity (only `Term` queries for now on leafs)
|
||||
// - Nested AND/OR combinations (on multiple fields)
|
||||
// - No-scoring path using the Count collector (focus on iterator/skip performance)
|
||||
// - Top-K retrieval (k=10) using the TopDocs collector
|
||||
//
|
||||
// Corpus model:
|
||||
// - Synthetic docs; each token a/b/c is independently included per doc
|
||||
// - If none of a/b/c are included, emit a neutral filler token to keep doc length similar
|
||||
//
|
||||
// Notes:
|
||||
// - After optimization, when scoring is disabled Tantivy reads doc-only postings
|
||||
// (IndexRecordOption::Basic), avoiding frequency decoding overhead.
|
||||
// - This bench isolates boolean iteration speed and intersection/union cost.
|
||||
// - Use `cargo bench --bench boolean_conjunction` to run.
|
||||
|
||||
use binggan::{black_box, BenchGroup, BenchRunner};
|
||||
use rand::prelude::*;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::collector::sort_key::SortByStaticFastValue;
|
||||
use tantivy::collector::{Collector, Count, TopDocs};
|
||||
use tantivy::query::QueryParser;
|
||||
use tantivy::schema::{Schema, FAST, TEXT};
|
||||
use tantivy::{doc, Index, Order, ReloadPolicy, Searcher};
|
||||
|
||||
#[derive(Clone)]
|
||||
struct BenchIndex {
|
||||
#[allow(dead_code)]
|
||||
index: Index,
|
||||
searcher: Searcher,
|
||||
query_parser: QueryParser,
|
||||
}
|
||||
|
||||
/// Build a single index containing both fields (title, body) and
|
||||
/// return two BenchIndex views:
|
||||
/// - single_field: QueryParser defaults to only "body"
|
||||
/// - multi_field: QueryParser defaults to ["title", "body"]
|
||||
fn build_index(num_docs: usize, terms: &[(&str, f32)]) -> (BenchIndex, BenchIndex) {
|
||||
// Unified schema (two text fields)
|
||||
let mut schema_builder = Schema::builder();
|
||||
let f_title = schema_builder.add_text_field("title", TEXT);
|
||||
let f_body = schema_builder.add_text_field("body", TEXT);
|
||||
let f_score = schema_builder.add_u64_field("score", FAST);
|
||||
let f_score2 = schema_builder.add_u64_field("score2", FAST);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
|
||||
// Populate index with stable RNG for reproducibility.
|
||||
let mut rng = StdRng::from_seed([7u8; 32]);
|
||||
|
||||
// Populate: spread each present token 90/10 to body/title
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 500_000_000).unwrap();
|
||||
for _ in 0..num_docs {
|
||||
let score = rng.random_range(0u64..100u64);
|
||||
let score2 = rng.random_range(0u64..100_000u64);
|
||||
let mut title_tokens: Vec<&str> = Vec::new();
|
||||
let mut body_tokens: Vec<&str> = Vec::new();
|
||||
for &(tok, prob) in terms {
|
||||
if rng.random_bool(prob as f64) {
|
||||
if rng.random_bool(0.1) {
|
||||
title_tokens.push(tok);
|
||||
} else {
|
||||
body_tokens.push(tok);
|
||||
}
|
||||
}
|
||||
}
|
||||
if title_tokens.is_empty() && body_tokens.is_empty() {
|
||||
body_tokens.push("z");
|
||||
}
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_title=>title_tokens.join(" "),
|
||||
f_body=>body_tokens.join(" "),
|
||||
f_score=>score,
|
||||
f_score2=>score2,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
}
|
||||
|
||||
// Prepare reader/searcher once.
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.unwrap();
|
||||
let searcher = reader.searcher();
|
||||
|
||||
// Build two query parsers with different default fields.
|
||||
let qp_single = QueryParser::for_index(&index, vec![f_body]);
|
||||
let qp_multi = QueryParser::for_index(&index, vec![f_title, f_body]);
|
||||
|
||||
let only_title = BenchIndex {
|
||||
index: index.clone(),
|
||||
searcher: searcher.clone(),
|
||||
query_parser: qp_single,
|
||||
};
|
||||
let title_and_body = BenchIndex {
|
||||
index,
|
||||
searcher,
|
||||
query_parser: qp_multi,
|
||||
};
|
||||
(only_title, title_and_body)
|
||||
}
|
||||
|
||||
fn format_pct(p: f32) -> String {
|
||||
let pct = (p as f64) * 100.0;
|
||||
let rounded = (pct * 1_000_000.0).round() / 1_000_000.0;
|
||||
if rounded.fract() <= 0.001 {
|
||||
format!("{}%", rounded as u64)
|
||||
} else {
|
||||
format!("{}%", rounded)
|
||||
}
|
||||
}
|
||||
|
||||
fn query_label(query_str: &str, term_pcts: &[(&str, String)]) -> String {
|
||||
let mut label = query_str.to_string();
|
||||
for (term, pct) in term_pcts {
|
||||
label = label.replace(term, pct);
|
||||
}
|
||||
label.replace(' ', "_")
|
||||
}
|
||||
|
||||
fn main() {
|
||||
// terms with varying selectivity, ordered from rarest to most common.
|
||||
// With 1M docs, we expect:
|
||||
// a: 0.01% (100), b: 1% (10k), c: 5% (50k), d: 15% (150k), e: 30% (300k)
|
||||
let num_docs = 1_000_000;
|
||||
let terms: &[(&str, f32)] = &[
|
||||
("a", 0.0001),
|
||||
("b", 0.01),
|
||||
("c", 0.05),
|
||||
("d", 0.15),
|
||||
("e", 0.30),
|
||||
];
|
||||
|
||||
let queries: &[(&str, &[&str])] = &[
|
||||
(
|
||||
"only_union",
|
||||
&["c OR b", "c OR b OR d", "c OR e", "e OR a"] as &[&str],
|
||||
),
|
||||
(
|
||||
"only_intersection",
|
||||
&["+c +b", "+c +b +d", "+c +e", "+e +a"] as &[&str],
|
||||
),
|
||||
(
|
||||
"union_intersection",
|
||||
&["+c +(b OR d)", "+e +(c OR a)", "+(c OR b) +(d OR e)"] as &[&str],
|
||||
),
|
||||
];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
let (only_title, title_and_body) = build_index(num_docs, terms);
|
||||
let term_pcts: Vec<(&str, String)> = terms
|
||||
.iter()
|
||||
.map(|&(term, p)| (term, format_pct(p)))
|
||||
.collect();
|
||||
|
||||
for (view_name, bench_index) in [
|
||||
("single_field", only_title),
|
||||
("multi_field", title_and_body),
|
||||
] {
|
||||
for (category_name, category_queries) in queries {
|
||||
for query_str in *category_queries {
|
||||
let mut group = runner.new_group();
|
||||
let query_label = query_label(query_str, &term_pcts);
|
||||
group.set_name(format!("{}_{}_{}", view_name, category_name, query_label));
|
||||
add_bench_task(&mut group, &bench_index, query_str, Count, "count");
|
||||
add_bench_task(
|
||||
&mut group,
|
||||
&bench_index,
|
||||
query_str,
|
||||
TopDocs::with_limit(10).order_by_score(),
|
||||
"top10_inv_idx",
|
||||
);
|
||||
add_bench_task(
|
||||
&mut group,
|
||||
&bench_index,
|
||||
query_str,
|
||||
(Count, TopDocs::with_limit(10).order_by_score()),
|
||||
"count+top10",
|
||||
);
|
||||
|
||||
add_bench_task(
|
||||
&mut group,
|
||||
&bench_index,
|
||||
query_str,
|
||||
TopDocs::with_limit(10).order_by_fast_field::<u64>("score", Order::Asc),
|
||||
"top10_by_ff",
|
||||
);
|
||||
add_bench_task(
|
||||
&mut group,
|
||||
&bench_index,
|
||||
query_str,
|
||||
TopDocs::with_limit(10).order_by((
|
||||
SortByStaticFastValue::<u64>::for_field("score"),
|
||||
SortByStaticFastValue::<u64>::for_field("score2"),
|
||||
)),
|
||||
"top10_by_2ff",
|
||||
);
|
||||
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
trait FruitCount {
|
||||
fn count(&self) -> usize;
|
||||
}
|
||||
|
||||
impl FruitCount for usize {
|
||||
fn count(&self) -> usize {
|
||||
*self
|
||||
}
|
||||
}
|
||||
|
||||
impl<T> FruitCount for Vec<T> {
|
||||
fn count(&self) -> usize {
|
||||
self.len()
|
||||
}
|
||||
}
|
||||
|
||||
impl<A: FruitCount, B> FruitCount for (A, B) {
|
||||
fn count(&self) -> usize {
|
||||
self.0.count()
|
||||
}
|
||||
}
|
||||
|
||||
fn add_bench_task<C: Collector + 'static>(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query_str: &str,
|
||||
collector: C,
|
||||
collector_name: &str,
|
||||
) where
|
||||
C::Fruit: FruitCount,
|
||||
{
|
||||
let query = bench_index.query_parser.parse_query(query_str).unwrap();
|
||||
let searcher = bench_index.searcher.clone();
|
||||
bench_group.register(collector_name.to_string(), move |_| {
|
||||
black_box(searcher.search(&query, &collector).unwrap().count())
|
||||
});
|
||||
}
|
||||
@@ -1,288 +0,0 @@
|
||||
use binggan::{black_box, BenchGroup, BenchRunner};
|
||||
use rand::prelude::*;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::collector::{Collector, Count, DocSetCollector, TopDocs};
|
||||
use tantivy::query::{Query, QueryParser};
|
||||
use tantivy::schema::{Schema, FAST, INDEXED, TEXT};
|
||||
use tantivy::{doc, Index, Order, ReloadPolicy, Searcher};
|
||||
|
||||
#[derive(Clone)]
|
||||
struct BenchIndex {
|
||||
#[allow(dead_code)]
|
||||
index: Index,
|
||||
searcher: Searcher,
|
||||
query_parser: QueryParser,
|
||||
}
|
||||
|
||||
fn build_shared_indices(num_docs: usize, p_title_a: f32, distribution: &str) -> BenchIndex {
|
||||
// Unified schema
|
||||
let mut schema_builder = Schema::builder();
|
||||
let f_title = schema_builder.add_text_field("title", TEXT);
|
||||
let f_num_rand = schema_builder.add_u64_field("num_rand", INDEXED);
|
||||
let f_num_asc = schema_builder.add_u64_field("num_asc", INDEXED);
|
||||
let f_num_rand_fast = schema_builder.add_u64_field("num_rand_fast", INDEXED | FAST);
|
||||
let f_num_asc_fast = schema_builder.add_u64_field("num_asc_fast", INDEXED | FAST);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
|
||||
// Populate index with stable RNG for reproducibility.
|
||||
let mut rng = StdRng::from_seed([7u8; 32]);
|
||||
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 4_000_000_000).unwrap();
|
||||
|
||||
match distribution {
|
||||
"dense" => {
|
||||
for doc_id in 0..num_docs {
|
||||
// Always add title to avoid empty documents
|
||||
let title_token = if rng.random_bool(p_title_a as f64) {
|
||||
"a"
|
||||
} else {
|
||||
"b"
|
||||
};
|
||||
|
||||
let num_rand = rng.random_range(0u64..1000u64);
|
||||
|
||||
let num_asc = (doc_id / 10000) as u64;
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_title=>title_token,
|
||||
f_num_rand=>num_rand,
|
||||
f_num_asc=>num_asc,
|
||||
f_num_rand_fast=>num_rand,
|
||||
f_num_asc_fast=>num_asc,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
"sparse" => {
|
||||
for doc_id in 0..num_docs {
|
||||
// Always add title to avoid empty documents
|
||||
let title_token = if rng.random_bool(p_title_a as f64) {
|
||||
"a"
|
||||
} else {
|
||||
"b"
|
||||
};
|
||||
|
||||
let num_rand = rng.random_range(0u64..10000000u64);
|
||||
|
||||
let num_asc = doc_id as u64;
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_title=>title_token,
|
||||
f_num_rand=>num_rand,
|
||||
f_num_asc=>num_asc,
|
||||
f_num_rand_fast=>num_rand,
|
||||
f_num_asc_fast=>num_asc,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
_ => {
|
||||
panic!("Unsupported distribution type");
|
||||
}
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
}
|
||||
|
||||
// Prepare reader/searcher once.
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.unwrap();
|
||||
let searcher = reader.searcher();
|
||||
|
||||
// Build query parser for title field
|
||||
let qp_title = QueryParser::for_index(&index, vec![f_title]);
|
||||
|
||||
BenchIndex {
|
||||
index,
|
||||
searcher,
|
||||
query_parser: qp_title,
|
||||
}
|
||||
}
|
||||
|
||||
fn main() {
|
||||
// Prepare corpora with varying scenarios
|
||||
let scenarios = vec![
|
||||
(
|
||||
"dense and 99% a".to_string(),
|
||||
10_000_000,
|
||||
0.99,
|
||||
"dense",
|
||||
0,
|
||||
9,
|
||||
),
|
||||
(
|
||||
"dense and 99% a".to_string(),
|
||||
10_000_000,
|
||||
0.99,
|
||||
"dense",
|
||||
990,
|
||||
999,
|
||||
),
|
||||
(
|
||||
"sparse and 99% a".to_string(),
|
||||
10_000_000,
|
||||
0.99,
|
||||
"sparse",
|
||||
0,
|
||||
9,
|
||||
),
|
||||
(
|
||||
"sparse and 99% a".to_string(),
|
||||
10_000_000,
|
||||
0.99,
|
||||
"sparse",
|
||||
9_999_990,
|
||||
9_999_999,
|
||||
),
|
||||
];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for (scenario_id, n, p_title_a, num_rand_distribution, range_low, range_high) in scenarios {
|
||||
// Build index for this scenario
|
||||
let bench_index = build_shared_indices(n, p_title_a, num_rand_distribution);
|
||||
|
||||
// Create benchmark group
|
||||
let mut group = runner.new_group();
|
||||
|
||||
// Now set the name (this moves scenario_id)
|
||||
group.set_name(scenario_id);
|
||||
|
||||
// Define all four field types
|
||||
let field_names = ["num_rand", "num_asc", "num_rand_fast", "num_asc_fast"];
|
||||
|
||||
// Define the three terms we want to test with
|
||||
let terms = ["a", "b", "z"];
|
||||
|
||||
// Generate all combinations of terms and field names
|
||||
let mut queries = Vec::new();
|
||||
for &term in &terms {
|
||||
for &field_name in &field_names {
|
||||
let query_str = format!(
|
||||
"{} AND {}:[{} TO {}]",
|
||||
term, field_name, range_low, range_high
|
||||
);
|
||||
queries.push((query_str, field_name.to_string()));
|
||||
}
|
||||
}
|
||||
|
||||
let query_str = format!(
|
||||
"{}:[{} TO {}] AND {}:[{} TO {}]",
|
||||
"num_rand_fast", range_low, range_high, "num_asc_fast", range_low, range_high
|
||||
);
|
||||
queries.push((query_str, "num_asc_fast".to_string()));
|
||||
|
||||
// Run all benchmark tasks for each query and its corresponding field name
|
||||
for (query_str, field_name) in queries {
|
||||
run_benchmark_tasks(&mut group, &bench_index, &query_str, &field_name);
|
||||
}
|
||||
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
|
||||
/// Run all benchmark tasks for a given query string and field name
|
||||
fn run_benchmark_tasks(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query_str: &str,
|
||||
field_name: &str,
|
||||
) {
|
||||
// Test count
|
||||
add_bench_task(bench_group, bench_index, query_str, Count, "count");
|
||||
|
||||
// Test all results
|
||||
add_bench_task(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query_str,
|
||||
DocSetCollector,
|
||||
"all results",
|
||||
);
|
||||
|
||||
// Test top 100 by the field (if it's a FAST field)
|
||||
if field_name.ends_with("_fast") {
|
||||
// Ascending order
|
||||
{
|
||||
let collector_name = format!("top100_by_{}_asc", field_name);
|
||||
let field_name_owned = field_name.to_string();
|
||||
add_bench_task(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query_str,
|
||||
TopDocs::with_limit(100).order_by_fast_field::<u64>(field_name_owned, Order::Asc),
|
||||
&collector_name,
|
||||
);
|
||||
}
|
||||
|
||||
// Descending order
|
||||
{
|
||||
let collector_name = format!("top100_by_{}_desc", field_name);
|
||||
let field_name_owned = field_name.to_string();
|
||||
add_bench_task(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query_str,
|
||||
TopDocs::with_limit(100).order_by_fast_field::<u64>(field_name_owned, Order::Desc),
|
||||
&collector_name,
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn add_bench_task<C: Collector + 'static>(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query_str: &str,
|
||||
collector: C,
|
||||
collector_name: &str,
|
||||
) {
|
||||
let task_name = format!("{}_{}", query_str.replace(" ", "_"), collector_name);
|
||||
let query = bench_index.query_parser.parse_query(query_str).unwrap();
|
||||
let search_task = SearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
collector,
|
||||
query,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
struct SearchTask<C: Collector> {
|
||||
searcher: Searcher,
|
||||
collector: C,
|
||||
query: Box<dyn Query>,
|
||||
}
|
||||
|
||||
impl<C: Collector> SearchTask<C> {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
let result = self.searcher.search(&self.query, &self.collector).unwrap();
|
||||
if let Some(count) = (&result as &dyn std::any::Any).downcast_ref::<usize>() {
|
||||
*count
|
||||
} else if let Some(top_docs) = (&result as &dyn std::any::Any)
|
||||
.downcast_ref::<Vec<(Option<u64>, tantivy::DocAddress)>>()
|
||||
{
|
||||
top_docs.len()
|
||||
} else if let Some(top_docs) =
|
||||
(&result as &dyn std::any::Any).downcast_ref::<Vec<(u64, tantivy::DocAddress)>>()
|
||||
{
|
||||
top_docs.len()
|
||||
} else if let Some(doc_set) = (&result as &dyn std::any::Any)
|
||||
.downcast_ref::<std::collections::HashSet<tantivy::DocAddress>>()
|
||||
{
|
||||
doc_set.len()
|
||||
} else {
|
||||
eprintln!(
|
||||
"Unknown collector result type: {:?}",
|
||||
std::any::type_name::<C::Fruit>()
|
||||
);
|
||||
0
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -1,69 +0,0 @@
|
||||
use binggan::plugins::PeakMemAllocPlugin;
|
||||
use binggan::{black_box, InputGroup, PeakMemAlloc, INSTRUMENTED_SYSTEM};
|
||||
use serde_json::json;
|
||||
use tantivy::collector::Count;
|
||||
use tantivy::query::ExistsQuery;
|
||||
use tantivy::schema::{Schema, FAST, TEXT};
|
||||
use tantivy::{doc, Index};
|
||||
|
||||
#[global_allocator]
|
||||
pub static GLOBAL: &PeakMemAlloc<std::alloc::System> = &INSTRUMENTED_SYSTEM;
|
||||
|
||||
fn main() {
|
||||
let doc_count: usize = 500_000;
|
||||
let subfield_counts: &[usize] = &[1, 2, 3, 4, 5, 6, 7, 8, 16, 256, 4096, 65536, 262144];
|
||||
|
||||
let indices: Vec<(String, Index)> = subfield_counts
|
||||
.iter()
|
||||
.map(|&sub_fields| {
|
||||
(
|
||||
format!("subfields={sub_fields}"),
|
||||
build_index_with_json_subfields(doc_count, sub_fields),
|
||||
)
|
||||
})
|
||||
.collect();
|
||||
|
||||
let mut group = InputGroup::new_with_inputs(indices);
|
||||
group.add_plugin(PeakMemAllocPlugin::new(GLOBAL));
|
||||
|
||||
group.config().num_iter_group = Some(1);
|
||||
group.config().num_iter_bench = Some(1);
|
||||
group.register("exists_json", exists_json_union);
|
||||
|
||||
group.run();
|
||||
}
|
||||
|
||||
fn exists_json_union(index: &Index) {
|
||||
let reader = index.reader().expect("reader");
|
||||
let searcher = reader.searcher();
|
||||
let query = ExistsQuery::new("json".to_string(), true);
|
||||
let count = searcher.search(&query, &Count).expect("exists search");
|
||||
// Prevents optimizer from eliding the search
|
||||
black_box(count);
|
||||
}
|
||||
|
||||
fn build_index_with_json_subfields(num_docs: usize, num_subfields: usize) -> Index {
|
||||
// Schema: single JSON field stored as FAST to support ExistsQuery.
|
||||
let mut schema_builder = Schema::builder();
|
||||
let json_field = schema_builder.add_json_field("json", TEXT | FAST);
|
||||
let schema = schema_builder.build();
|
||||
|
||||
let index = Index::create_from_tempdir(schema).expect("create index");
|
||||
{
|
||||
let mut index_writer = index
|
||||
.writer_with_num_threads(1, 200_000_000)
|
||||
.expect("writer");
|
||||
for i in 0..num_docs {
|
||||
let sub = i % num_subfields;
|
||||
// Only one subpath set per document; rotate subpaths so that
|
||||
// no single subpath is full, but the union covers all docs.
|
||||
let v = json!({ format!("field_{sub}"): i as u64 });
|
||||
index_writer
|
||||
.add_document(doc!(json_field => v))
|
||||
.expect("add_document");
|
||||
}
|
||||
index_writer.commit().expect("commit");
|
||||
}
|
||||
|
||||
index
|
||||
}
|
||||
@@ -1,149 +0,0 @@
|
||||
// Benchmarks top-K intersection of term scorers (block_wand_intersection).
|
||||
//
|
||||
// What's measured:
|
||||
// - Conjunctive queries (+a +b, +a +b +c) with top-10 by score
|
||||
// - Varying doc-frequency balance between terms (balanced, skewed, very skewed)
|
||||
// - Realistic term frequencies (geometric distribution, mostly low)
|
||||
// - 1M-doc single segment
|
||||
//
|
||||
// Run with: cargo bench --bench intersection_bench
|
||||
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use rand::prelude::*;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::collector::TopDocs;
|
||||
use tantivy::query::QueryParser;
|
||||
use tantivy::schema::{Schema, TEXT};
|
||||
use tantivy::{doc, Index, ReloadPolicy, Searcher};
|
||||
|
||||
const NUM_DOCS: usize = 1_000_000;
|
||||
|
||||
struct BenchIndex {
|
||||
searcher: Searcher,
|
||||
query_parser: QueryParser,
|
||||
}
|
||||
|
||||
/// Generate term frequency from a geometric-like distribution.
|
||||
/// Most values are 1, a few are 2-3, rarely higher.
|
||||
/// p controls the decay: higher p → more weight on tf=1.
|
||||
fn random_term_freq(rng: &mut StdRng, p: f64) -> u32 {
|
||||
let mut tf = 1u32;
|
||||
while tf < 10 && rng.random_bool(1.0 - p) {
|
||||
tf += 1;
|
||||
}
|
||||
tf
|
||||
}
|
||||
|
||||
/// Build an index with three terms (a, b, c) with given doc-frequency probabilities.
|
||||
/// Each term occurrence has a realistic term frequency (geometric distribution).
|
||||
/// Field length is padded with filler tokens to create varied fieldnorms.
|
||||
fn build_index(p_a: f64, p_b: f64, p_c: f64) -> BenchIndex {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let body = schema_builder.add_text_field("body", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
|
||||
let mut rng = StdRng::from_seed([42u8; 32]);
|
||||
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 500_000_000).unwrap();
|
||||
for _ in 0..NUM_DOCS {
|
||||
let mut tokens: Vec<String> = Vec::new();
|
||||
|
||||
if rng.random_bool(p_a) {
|
||||
let tf = random_term_freq(&mut rng, 0.7);
|
||||
for _ in 0..tf {
|
||||
tokens.push("aaa".to_string());
|
||||
}
|
||||
}
|
||||
if rng.random_bool(p_b) {
|
||||
let tf = random_term_freq(&mut rng, 0.7);
|
||||
for _ in 0..tf {
|
||||
tokens.push("bbb".to_string());
|
||||
}
|
||||
}
|
||||
if rng.random_bool(p_c) {
|
||||
let tf = random_term_freq(&mut rng, 0.7);
|
||||
for _ in 0..tf {
|
||||
tokens.push("ccc".to_string());
|
||||
}
|
||||
}
|
||||
|
||||
// Pad with filler to create varied field lengths (5-30 tokens).
|
||||
let filler_count = rng.random_range(5u32..30u32);
|
||||
for _ in 0..filler_count {
|
||||
tokens.push("filler".to_string());
|
||||
}
|
||||
|
||||
let text = tokens.join(" ");
|
||||
writer.add_document(doc!(body => text)).unwrap();
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
}
|
||||
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.unwrap();
|
||||
let searcher = reader.searcher();
|
||||
let query_parser = QueryParser::for_index(&index, vec![body]);
|
||||
|
||||
BenchIndex {
|
||||
searcher,
|
||||
query_parser,
|
||||
}
|
||||
}
|
||||
|
||||
fn main() {
|
||||
// Scenarios: (label, p_a, p_b, p_c)
|
||||
//
|
||||
// "balanced": all terms ~10% → intersection ~1% of docs
|
||||
// "skewed": one common (50%), one rare (2%) → intersection ~1%
|
||||
// "very_skewed": one very common (80%), one very rare (0.5%) → intersection ~0.4%
|
||||
// "three_balanced": three terms ~20% each → intersection ~0.8%
|
||||
// "three_skewed": 50% / 10% / 2% → intersection ~0.1%
|
||||
let scenarios: Vec<(&str, f64, f64, f64)> = vec![
|
||||
("balanced_10%_10%", 0.10, 0.10, 0.0),
|
||||
("skewed_50%_2%", 0.50, 0.02, 0.0),
|
||||
("very_skewed_80%_0.5%", 0.80, 0.005, 0.0),
|
||||
("three_balanced_20%_20%_20%", 0.20, 0.20, 0.20),
|
||||
("three_skewed_50%_10%_2%", 0.50, 0.10, 0.02),
|
||||
];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
|
||||
for (label, p_a, p_b, p_c) in &scenarios {
|
||||
let bench_index = build_index(*p_a, *p_b, *p_c);
|
||||
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(format!("intersection — {label}"));
|
||||
|
||||
// Two-term intersection
|
||||
if *p_a > 0.0 && *p_b > 0.0 {
|
||||
let query_str = "+aaa +bbb";
|
||||
let query = bench_index.query_parser.parse_query(query_str).unwrap();
|
||||
let searcher = bench_index.searcher.clone();
|
||||
group.register(format!("{query_str} top10"), move |_| {
|
||||
let collector = TopDocs::with_limit(10).order_by_score();
|
||||
black_box(searcher.search(&query, &collector).unwrap());
|
||||
1usize
|
||||
});
|
||||
}
|
||||
|
||||
// Three-term intersection
|
||||
if *p_c > 0.0 {
|
||||
let query_str = "+aaa +bbb +ccc";
|
||||
let query = bench_index.query_parser.parse_query(query_str).unwrap();
|
||||
let searcher = bench_index.searcher.clone();
|
||||
group.register(format!("{query_str} top10"), move |_| {
|
||||
let collector = TopDocs::with_limit(10).order_by_score();
|
||||
black_box(searcher.search(&query, &collector).unwrap());
|
||||
1usize
|
||||
});
|
||||
}
|
||||
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
@@ -1,224 +0,0 @@
|
||||
// Benchmarks segment merging
|
||||
//
|
||||
// Notes:
|
||||
// - Input segments are kept intact (no deletes / no IndexWriter merge).
|
||||
// - Output is written to a `NullDirectory` that discards all files except
|
||||
// fieldnorms (needed for merging).
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::io::{self, Write};
|
||||
use std::path::{Path, PathBuf};
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use rand::prelude::*;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::directory::error::{DeleteError, OpenReadError, OpenWriteError};
|
||||
use tantivy::directory::{
|
||||
AntiCallToken, Directory, FileHandle, OwnedBytes, TerminatingWrite, WatchCallback, WatchHandle,
|
||||
WritePtr,
|
||||
};
|
||||
use tantivy::indexer::{merge_filtered_segments, NoMergePolicy};
|
||||
use tantivy::schema::{Schema, TEXT};
|
||||
use tantivy::{doc, HasLen, Index, IndexSettings, Segment};
|
||||
|
||||
#[derive(Clone, Default, Debug)]
|
||||
struct NullDirectory {
|
||||
blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
|
||||
}
|
||||
|
||||
struct NullWriter;
|
||||
|
||||
impl Write for NullWriter {
|
||||
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
|
||||
Ok(buf.len())
|
||||
}
|
||||
|
||||
fn flush(&mut self) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl TerminatingWrite for NullWriter {
|
||||
fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
struct InMemoryWriter {
|
||||
path: PathBuf,
|
||||
buffer: Vec<u8>,
|
||||
blobs: Arc<RwLock<HashMap<PathBuf, OwnedBytes>>>,
|
||||
}
|
||||
|
||||
impl Write for InMemoryWriter {
|
||||
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
|
||||
self.buffer.extend_from_slice(buf);
|
||||
Ok(buf.len())
|
||||
}
|
||||
|
||||
fn flush(&mut self) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl TerminatingWrite for InMemoryWriter {
|
||||
fn terminate_ref(&mut self, _token: AntiCallToken) -> io::Result<()> {
|
||||
let bytes = OwnedBytes::new(std::mem::take(&mut self.buffer));
|
||||
self.blobs.write().unwrap().insert(self.path.clone(), bytes);
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Debug, Default)]
|
||||
struct NullFileHandle;
|
||||
impl HasLen for NullFileHandle {
|
||||
fn len(&self) -> usize {
|
||||
0
|
||||
}
|
||||
}
|
||||
impl FileHandle for NullFileHandle {
|
||||
fn read_bytes(&self, _range: std::ops::Range<usize>) -> io::Result<OwnedBytes> {
|
||||
unimplemented!()
|
||||
}
|
||||
}
|
||||
|
||||
impl Directory for NullDirectory {
|
||||
fn get_file_handle(&self, path: &Path) -> Result<Arc<dyn FileHandle>, OpenReadError> {
|
||||
if let Some(bytes) = self.blobs.read().unwrap().get(path) {
|
||||
return Ok(Arc::new(bytes.clone()));
|
||||
}
|
||||
Ok(Arc::new(NullFileHandle))
|
||||
}
|
||||
|
||||
fn delete(&self, _path: &Path) -> Result<(), DeleteError> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn exists(&self, _path: &Path) -> Result<bool, OpenReadError> {
|
||||
Ok(true)
|
||||
}
|
||||
|
||||
fn open_write(&self, path: &Path) -> Result<WritePtr, OpenWriteError> {
|
||||
let path_buf = path.to_path_buf();
|
||||
if path.to_string_lossy().ends_with(".fieldnorm") {
|
||||
let writer = InMemoryWriter {
|
||||
path: path_buf,
|
||||
buffer: Vec::new(),
|
||||
blobs: Arc::clone(&self.blobs),
|
||||
};
|
||||
Ok(io::BufWriter::new(Box::new(writer)))
|
||||
} else {
|
||||
Ok(io::BufWriter::new(Box::new(NullWriter)))
|
||||
}
|
||||
}
|
||||
|
||||
fn atomic_read(&self, path: &Path) -> Result<Vec<u8>, OpenReadError> {
|
||||
if let Some(bytes) = self.blobs.read().unwrap().get(path) {
|
||||
return Ok(bytes.as_slice().to_vec());
|
||||
}
|
||||
Err(OpenReadError::FileDoesNotExist(path.to_path_buf()))
|
||||
}
|
||||
|
||||
fn atomic_write(&self, _path: &Path, _data: &[u8]) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn sync_directory(&self) -> io::Result<()> {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn watch(&self, _watch_callback: WatchCallback) -> tantivy::Result<WatchHandle> {
|
||||
Ok(WatchHandle::empty())
|
||||
}
|
||||
}
|
||||
|
||||
struct MergeScenario {
|
||||
#[allow(dead_code)]
|
||||
index: Index,
|
||||
segments: Vec<Segment>,
|
||||
settings: IndexSettings,
|
||||
label: String,
|
||||
}
|
||||
|
||||
fn build_index(
|
||||
num_segments: usize,
|
||||
docs_per_segment: usize,
|
||||
tokens_per_doc: usize,
|
||||
vocab_size: usize,
|
||||
) -> MergeScenario {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let body = schema_builder.add_text_field("body", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
|
||||
assert!(vocab_size > 0);
|
||||
let total_tokens = num_segments * docs_per_segment * tokens_per_doc;
|
||||
let use_unique_terms = vocab_size >= total_tokens;
|
||||
let mut rng = StdRng::from_seed([7u8; 32]);
|
||||
let mut next_token_id: u64 = 0;
|
||||
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 256_000_000).unwrap();
|
||||
writer.set_merge_policy(Box::new(NoMergePolicy));
|
||||
for _ in 0..num_segments {
|
||||
for _ in 0..docs_per_segment {
|
||||
let mut tokens = Vec::with_capacity(tokens_per_doc);
|
||||
for _ in 0..tokens_per_doc {
|
||||
let token_id = if use_unique_terms {
|
||||
let id = next_token_id;
|
||||
next_token_id += 1;
|
||||
id
|
||||
} else {
|
||||
rng.random_range(0..vocab_size as u64)
|
||||
};
|
||||
tokens.push(format!("term_{token_id}"));
|
||||
}
|
||||
writer.add_document(doc!(body => tokens.join(" "))).unwrap();
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
let segments = index.searchable_segments().unwrap();
|
||||
let settings = index.settings().clone();
|
||||
let label = format!(
|
||||
"segments={}, docs/seg={}, tokens/doc={}, vocab={}",
|
||||
num_segments, docs_per_segment, tokens_per_doc, vocab_size
|
||||
);
|
||||
|
||||
MergeScenario {
|
||||
index,
|
||||
segments,
|
||||
settings,
|
||||
label,
|
||||
}
|
||||
}
|
||||
|
||||
fn main() {
|
||||
let scenarios = vec![
|
||||
build_index(8, 50_000, 12, 8),
|
||||
build_index(16, 50_000, 12, 8),
|
||||
build_index(16, 100_000, 12, 8),
|
||||
build_index(8, 50_000, 8, 8 * 50_000 * 8),
|
||||
];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for scenario in scenarios {
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(format!("merge_segments inv_index — {}", scenario.label));
|
||||
let segments = scenario.segments.clone();
|
||||
let settings = scenario.settings.clone();
|
||||
group.register("merge", move |_| {
|
||||
let output_dir = NullDirectory::default();
|
||||
let filter_doc_ids = vec![None; segments.len()];
|
||||
let merged_index =
|
||||
merge_filtered_segments(&segments, settings.clone(), filter_doc_ids, output_dir)
|
||||
.unwrap();
|
||||
black_box(merged_index);
|
||||
});
|
||||
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
@@ -1,35 +0,0 @@
|
||||
// Benchmark for the query grammar parsing deeply nested queries.
|
||||
//
|
||||
// Regression guard for https://github.com/quickwit-oss/tantivy/issues/2498:
|
||||
// at depth 20/21 the old parser took 0.87 s / 1.72 s respectively because
|
||||
// `ast()` retried `occur_leaf` on backtrack, giving O(2^n) time. With the
|
||||
// fix parsing is linear and completes in microseconds.
|
||||
//
|
||||
// Run with: `cargo bench --bench query_parser_nested`.
|
||||
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use tantivy::query_grammar::parse_query;
|
||||
|
||||
fn nested_query(depth: usize, leading_plus: bool) -> String {
|
||||
let leading = "(".repeat(depth);
|
||||
let trailing = ")".repeat(depth);
|
||||
let prefix = if leading_plus { "+" } else { "" };
|
||||
format!("{prefix}{leading}title:test{trailing}")
|
||||
}
|
||||
|
||||
fn main() {
|
||||
let mut runner = BenchRunner::new();
|
||||
|
||||
for depth in [20, 21] {
|
||||
for leading_plus in [false, true] {
|
||||
let query = nested_query(depth, leading_plus);
|
||||
let label = format!(
|
||||
"parse_nested_depth_{depth}_{}",
|
||||
if leading_plus { "plus" } else { "plain" },
|
||||
);
|
||||
runner.bench_function(&label, move |_| {
|
||||
black_box(parse_query(black_box(&query)).unwrap());
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -1,365 +0,0 @@
|
||||
use std::ops::Bound;
|
||||
|
||||
use binggan::{black_box, BenchGroup, BenchRunner};
|
||||
use rand::prelude::*;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::collector::{Count, DocSetCollector, TopDocs};
|
||||
use tantivy::query::RangeQuery;
|
||||
use tantivy::schema::{Schema, FAST, INDEXED};
|
||||
use tantivy::{doc, Index, Order, ReloadPolicy, Searcher, Term};
|
||||
|
||||
#[derive(Clone)]
|
||||
struct BenchIndex {
|
||||
#[allow(dead_code)]
|
||||
index: Index,
|
||||
searcher: Searcher,
|
||||
}
|
||||
|
||||
fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
|
||||
// Schema with fast fields only
|
||||
let mut schema_builder = Schema::builder();
|
||||
let f_num_rand_fast = schema_builder.add_u64_field("num_rand_fast", INDEXED | FAST);
|
||||
let f_num_asc_fast = schema_builder.add_u64_field("num_asc_fast", INDEXED | FAST);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
|
||||
// Populate index with stable RNG for reproducibility.
|
||||
let mut rng = StdRng::from_seed([7u8; 32]);
|
||||
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 4_000_000_000).unwrap();
|
||||
|
||||
match distribution {
|
||||
"dense" => {
|
||||
for doc_id in 0..num_docs {
|
||||
let num_rand = rng.random_range(0u64..1000u64);
|
||||
let num_asc = (doc_id / 10000) as u64;
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_num_rand_fast=>num_rand,
|
||||
f_num_asc_fast=>num_asc,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
"sparse" => {
|
||||
for doc_id in 0..num_docs {
|
||||
let num_rand = rng.random_range(0u64..10000000u64);
|
||||
let num_asc = doc_id as u64;
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_num_rand_fast=>num_rand,
|
||||
f_num_asc_fast=>num_asc,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
_ => {
|
||||
panic!("Unsupported distribution type");
|
||||
}
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
}
|
||||
|
||||
// Prepare reader/searcher once.
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.unwrap();
|
||||
let searcher = reader.searcher();
|
||||
|
||||
BenchIndex { index, searcher }
|
||||
}
|
||||
|
||||
fn main() {
|
||||
// Prepare corpora with varying scenarios
|
||||
let scenarios = vec![
|
||||
// Dense distribution - random values in small range (0-999)
|
||||
(
|
||||
"dense_values_search_low_value_range".to_string(),
|
||||
10_000_000,
|
||||
"dense",
|
||||
0,
|
||||
9,
|
||||
),
|
||||
(
|
||||
"dense_values_search_high_value_range".to_string(),
|
||||
10_000_000,
|
||||
"dense",
|
||||
990,
|
||||
999,
|
||||
),
|
||||
(
|
||||
"dense_values_search_out_of_range".to_string(),
|
||||
10_000_000,
|
||||
"dense",
|
||||
1000,
|
||||
1002,
|
||||
),
|
||||
(
|
||||
"sparse_values_search_low_value_range".to_string(),
|
||||
10_000_000,
|
||||
"sparse",
|
||||
0,
|
||||
9,
|
||||
),
|
||||
(
|
||||
"sparse_values_search_high_value_range".to_string(),
|
||||
10_000_000,
|
||||
"sparse",
|
||||
9_999_990,
|
||||
9_999_999,
|
||||
),
|
||||
(
|
||||
"sparse_values_search_out_of_range".to_string(),
|
||||
10_000_000,
|
||||
"sparse",
|
||||
10_000_000,
|
||||
10_000_002,
|
||||
),
|
||||
];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for (scenario_id, n, num_rand_distribution, range_low, range_high) in scenarios {
|
||||
// Build index for this scenario
|
||||
let bench_index = build_shared_indices(n, num_rand_distribution);
|
||||
|
||||
// Create benchmark group
|
||||
let mut group = runner.new_group();
|
||||
|
||||
// Now set the name (this moves scenario_id)
|
||||
group.set_name(scenario_id);
|
||||
|
||||
// Define fast field types
|
||||
let field_names = ["num_rand_fast", "num_asc_fast"];
|
||||
|
||||
// Generate range queries for fast fields
|
||||
for &field_name in &field_names {
|
||||
// Create the range query
|
||||
let field = bench_index.searcher.schema().get_field(field_name).unwrap();
|
||||
let lower_term = Term::from_field_u64(field, range_low);
|
||||
let upper_term = Term::from_field_u64(field, range_high);
|
||||
|
||||
let query = RangeQuery::new(Bound::Included(lower_term), Bound::Included(upper_term));
|
||||
|
||||
run_benchmark_tasks(
|
||||
&mut group,
|
||||
&bench_index,
|
||||
query,
|
||||
field_name,
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
}
|
||||
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
|
||||
/// Run all benchmark tasks for a given range query and field name
|
||||
fn run_benchmark_tasks(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
field_name: &str,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
// Test count
|
||||
add_bench_task_count(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query.clone(),
|
||||
"count",
|
||||
field_name,
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
|
||||
// Test top 100 by the field (ascending order)
|
||||
{
|
||||
let collector_name = format!("top100_by_{}_asc", field_name);
|
||||
let field_name_owned = field_name.to_string();
|
||||
add_bench_task_top100_asc(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query.clone(),
|
||||
&collector_name,
|
||||
field_name,
|
||||
range_low,
|
||||
range_high,
|
||||
field_name_owned,
|
||||
);
|
||||
}
|
||||
|
||||
// Test top 100 by the field (descending order)
|
||||
{
|
||||
let collector_name = format!("top100_by_{}_desc", field_name);
|
||||
let field_name_owned = field_name.to_string();
|
||||
add_bench_task_top100_desc(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query,
|
||||
&collector_name,
|
||||
field_name,
|
||||
range_low,
|
||||
range_high,
|
||||
field_name_owned,
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
fn add_bench_task_count(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
collector_name: &str,
|
||||
field_name: &str,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!(
|
||||
"range_{}_[{} TO {}]_{}",
|
||||
field_name, range_low, range_high, collector_name
|
||||
);
|
||||
|
||||
let search_task = CountSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
fn add_bench_task_docset(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
collector_name: &str,
|
||||
field_name: &str,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!(
|
||||
"range_{}_[{} TO {}]_{}",
|
||||
field_name, range_low, range_high, collector_name
|
||||
);
|
||||
|
||||
let search_task = DocSetSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
fn add_bench_task_top100_asc(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
collector_name: &str,
|
||||
field_name: &str,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
field_name_owned: String,
|
||||
) {
|
||||
let task_name = format!(
|
||||
"range_{}_[{} TO {}]_{}",
|
||||
field_name, range_low, range_high, collector_name
|
||||
);
|
||||
|
||||
let search_task = Top100AscSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
field_name: field_name_owned,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
fn add_bench_task_top100_desc(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
collector_name: &str,
|
||||
field_name: &str,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
field_name_owned: String,
|
||||
) {
|
||||
let task_name = format!(
|
||||
"range_{}_[{} TO {}]_{}",
|
||||
field_name, range_low, range_high, collector_name
|
||||
);
|
||||
|
||||
let search_task = Top100DescSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
field_name: field_name_owned,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
struct CountSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl CountSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
self.searcher.search(&self.query, &Count).unwrap()
|
||||
}
|
||||
}
|
||||
|
||||
struct DocSetSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl DocSetSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
let result = self.searcher.search(&self.query, &DocSetCollector).unwrap();
|
||||
result.len()
|
||||
}
|
||||
}
|
||||
|
||||
struct Top100AscSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
field_name: String,
|
||||
}
|
||||
|
||||
impl Top100AscSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
let collector =
|
||||
TopDocs::with_limit(100).order_by_fast_field::<u64>(&self.field_name, Order::Asc);
|
||||
let result = self.searcher.search(&self.query, &collector).unwrap();
|
||||
for (_score, doc_address) in &result {
|
||||
let _doc: tantivy::TantivyDocument = self.searcher.doc(*doc_address).unwrap();
|
||||
}
|
||||
result.len()
|
||||
}
|
||||
}
|
||||
|
||||
struct Top100DescSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
field_name: String,
|
||||
}
|
||||
|
||||
impl Top100DescSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
let collector =
|
||||
TopDocs::with_limit(100).order_by_fast_field::<u64>(&self.field_name, Order::Desc);
|
||||
let result = self.searcher.search(&self.query, &collector).unwrap();
|
||||
for (_score, doc_address) in &result {
|
||||
let _doc: tantivy::TantivyDocument = self.searcher.doc(*doc_address).unwrap();
|
||||
}
|
||||
result.len()
|
||||
}
|
||||
}
|
||||
@@ -1,260 +0,0 @@
|
||||
use std::fmt::Display;
|
||||
use std::net::Ipv6Addr;
|
||||
use std::ops::RangeInclusive;
|
||||
|
||||
use binggan::plugins::PeakMemAllocPlugin;
|
||||
use binggan::{black_box, BenchRunner, OutputValue, PeakMemAlloc, INSTRUMENTED_SYSTEM};
|
||||
use columnar::MonotonicallyMappableToU128;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::{Rng, SeedableRng};
|
||||
use tantivy::collector::{Count, TopDocs};
|
||||
use tantivy::query::QueryParser;
|
||||
use tantivy::schema::*;
|
||||
use tantivy::{doc, Index};
|
||||
|
||||
#[global_allocator]
|
||||
pub static GLOBAL: &PeakMemAlloc<std::alloc::System> = &INSTRUMENTED_SYSTEM;
|
||||
|
||||
fn main() {
|
||||
bench_range_query();
|
||||
}
|
||||
|
||||
fn bench_range_query() {
|
||||
let index = get_index_0_to_100();
|
||||
let mut runner = BenchRunner::new();
|
||||
runner.add_plugin(PeakMemAllocPlugin::new(GLOBAL));
|
||||
|
||||
runner.set_name("range_query on u64");
|
||||
let field_name_and_descr: Vec<_> = vec![
|
||||
("id", "Single Valued Range Field"),
|
||||
("ids", "Multi Valued Range Field"),
|
||||
];
|
||||
let range_num_hits = vec![
|
||||
("90_percent", get_90_percent()),
|
||||
("10_percent", get_10_percent()),
|
||||
("1_percent", get_1_percent()),
|
||||
];
|
||||
|
||||
test_range(&mut runner, &index, &field_name_and_descr, range_num_hits);
|
||||
|
||||
runner.set_name("range_query on ip");
|
||||
let field_name_and_descr: Vec<_> = vec![
|
||||
("ip", "Single Valued Range Field"),
|
||||
("ips", "Multi Valued Range Field"),
|
||||
];
|
||||
let range_num_hits = vec![
|
||||
("90_percent", get_90_percent_ip()),
|
||||
("10_percent", get_10_percent_ip()),
|
||||
("1_percent", get_1_percent_ip()),
|
||||
];
|
||||
|
||||
test_range(&mut runner, &index, &field_name_and_descr, range_num_hits);
|
||||
}
|
||||
|
||||
fn test_range<T: Display>(
|
||||
runner: &mut BenchRunner,
|
||||
index: &Index,
|
||||
field_name_and_descr: &[(&str, &str)],
|
||||
range_num_hits: Vec<(&str, RangeInclusive<T>)>,
|
||||
) {
|
||||
for (field, suffix) in field_name_and_descr {
|
||||
let term_num_hits = vec![
|
||||
("", ""),
|
||||
("1_percent", "veryfew"),
|
||||
("10_percent", "few"),
|
||||
("90_percent", "most"),
|
||||
];
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(suffix);
|
||||
// all intersect combinations
|
||||
for (range_name, range) in &range_num_hits {
|
||||
for (term_name, term) in &term_num_hits {
|
||||
let index = &index;
|
||||
let test_name = if term_name.is_empty() {
|
||||
format!("id_range_hit_{}", range_name)
|
||||
} else {
|
||||
format!(
|
||||
"id_range_hit_{}_intersect_with_term_{}",
|
||||
range_name, term_name
|
||||
)
|
||||
};
|
||||
group.register(test_name, move |_| {
|
||||
let query = if term_name.is_empty() {
|
||||
"".to_string()
|
||||
} else {
|
||||
format!("AND id_name:{}", term)
|
||||
};
|
||||
black_box(execute_query(field, range, &query, index));
|
||||
});
|
||||
}
|
||||
}
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
|
||||
fn get_index_0_to_100() -> Index {
|
||||
let mut rng = StdRng::from_seed([1u8; 32]);
|
||||
let num_vals = 100_000;
|
||||
let docs: Vec<_> = (0..num_vals)
|
||||
.map(|_i| {
|
||||
let id_name = if rng.random_bool(0.01) {
|
||||
"veryfew".to_string() // 1%
|
||||
} else if rng.random_bool(0.1) {
|
||||
"few".to_string() // 9%
|
||||
} else {
|
||||
"most".to_string() // 90%
|
||||
};
|
||||
Doc {
|
||||
id_name,
|
||||
id: rng.random_range(0..100),
|
||||
// Multiply by 1000, so that we create most buckets in the compact space
|
||||
// The benches depend on this range to select n-percent of elements with the
|
||||
// methods below.
|
||||
ip: Ipv6Addr::from_u128(rng.random_range(0..100) * 1000),
|
||||
}
|
||||
})
|
||||
.collect();
|
||||
|
||||
create_index_from_docs(&docs)
|
||||
}
|
||||
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct Doc {
|
||||
pub id_name: String,
|
||||
pub id: u64,
|
||||
pub ip: Ipv6Addr,
|
||||
}
|
||||
|
||||
pub fn create_index_from_docs(docs: &[Doc]) -> Index {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let id_u64_field = schema_builder.add_u64_field("id", INDEXED | STORED | FAST);
|
||||
let ids_u64_field =
|
||||
schema_builder.add_u64_field("ids", NumericOptions::default().set_fast().set_indexed());
|
||||
|
||||
let id_f64_field = schema_builder.add_f64_field("id_f64", INDEXED | STORED | FAST);
|
||||
let ids_f64_field = schema_builder.add_f64_field(
|
||||
"ids_f64",
|
||||
NumericOptions::default().set_fast().set_indexed(),
|
||||
);
|
||||
|
||||
let id_i64_field = schema_builder.add_i64_field("id_i64", INDEXED | STORED | FAST);
|
||||
let ids_i64_field = schema_builder.add_i64_field(
|
||||
"ids_i64",
|
||||
NumericOptions::default().set_fast().set_indexed(),
|
||||
);
|
||||
|
||||
let text_field = schema_builder.add_text_field("id_name", STRING | STORED);
|
||||
let text_field2 = schema_builder.add_text_field("id_name_fast", STRING | STORED | FAST);
|
||||
|
||||
let ip_field = schema_builder.add_ip_addr_field("ip", FAST);
|
||||
let ips_field = schema_builder.add_ip_addr_field("ips", FAST);
|
||||
|
||||
let schema = schema_builder.build();
|
||||
|
||||
let index = Index::create_in_ram(schema);
|
||||
|
||||
{
|
||||
let mut index_writer = index.writer_with_num_threads(1, 50_000_000).unwrap();
|
||||
for doc in docs.iter() {
|
||||
index_writer
|
||||
.add_document(doc!(
|
||||
ids_i64_field => doc.id as i64,
|
||||
ids_i64_field => doc.id as i64,
|
||||
ids_f64_field => doc.id as f64,
|
||||
ids_f64_field => doc.id as f64,
|
||||
ids_u64_field => doc.id,
|
||||
ids_u64_field => doc.id,
|
||||
id_u64_field => doc.id,
|
||||
id_f64_field => doc.id as f64,
|
||||
id_i64_field => doc.id as i64,
|
||||
text_field => doc.id_name.to_string(),
|
||||
text_field2 => doc.id_name.to_string(),
|
||||
ips_field => doc.ip,
|
||||
ips_field => doc.ip,
|
||||
ip_field => doc.ip,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
|
||||
index_writer.commit().unwrap();
|
||||
}
|
||||
index
|
||||
}
|
||||
|
||||
fn get_90_percent() -> RangeInclusive<u64> {
|
||||
0..=90
|
||||
}
|
||||
|
||||
fn get_10_percent() -> RangeInclusive<u64> {
|
||||
0..=10
|
||||
}
|
||||
|
||||
fn get_1_percent() -> RangeInclusive<u64> {
|
||||
10..=10
|
||||
}
|
||||
|
||||
fn get_90_percent_ip() -> RangeInclusive<Ipv6Addr> {
|
||||
let start = Ipv6Addr::from_u128(0);
|
||||
let end = Ipv6Addr::from_u128(90 * 1000);
|
||||
start..=end
|
||||
}
|
||||
|
||||
fn get_10_percent_ip() -> RangeInclusive<Ipv6Addr> {
|
||||
let start = Ipv6Addr::from_u128(0);
|
||||
let end = Ipv6Addr::from_u128(10 * 1000);
|
||||
start..=end
|
||||
}
|
||||
|
||||
fn get_1_percent_ip() -> RangeInclusive<Ipv6Addr> {
|
||||
let start = Ipv6Addr::from_u128(10 * 1000);
|
||||
let end = Ipv6Addr::from_u128(10 * 1000);
|
||||
start..=end
|
||||
}
|
||||
|
||||
struct NumHits {
|
||||
count: usize,
|
||||
}
|
||||
impl OutputValue for NumHits {
|
||||
fn column_title() -> &'static str {
|
||||
"NumHits"
|
||||
}
|
||||
fn format(&self) -> Option<String> {
|
||||
Some(self.count.to_string())
|
||||
}
|
||||
}
|
||||
|
||||
fn execute_query<T: Display>(
|
||||
field: &str,
|
||||
id_range: &RangeInclusive<T>,
|
||||
suffix: &str,
|
||||
index: &Index,
|
||||
) -> NumHits {
|
||||
let gen_query_inclusive = |from: &T, to: &T| {
|
||||
format!(
|
||||
"{}:[{} TO {}] {}",
|
||||
field,
|
||||
&from.to_string(),
|
||||
&to.to_string(),
|
||||
suffix
|
||||
)
|
||||
};
|
||||
|
||||
let query = gen_query_inclusive(id_range.start(), id_range.end());
|
||||
execute_query_(&query, index)
|
||||
}
|
||||
|
||||
fn execute_query_(query: &str, index: &Index) -> NumHits {
|
||||
let query_from_text = |text: &str| {
|
||||
QueryParser::for_index(index, vec![])
|
||||
.parse_query(text)
|
||||
.unwrap()
|
||||
};
|
||||
let query = query_from_text(query);
|
||||
let reader = index.reader().unwrap();
|
||||
let searcher = reader.searcher();
|
||||
let num_hits = searcher
|
||||
.search(&query, &(TopDocs::with_limit(10).order_by_score(), Count))
|
||||
.unwrap()
|
||||
.1;
|
||||
NumHits { count: num_hits }
|
||||
}
|
||||
@@ -1,113 +0,0 @@
|
||||
// Benchmarks regex query that matches all terms in a synthetic index.
|
||||
//
|
||||
// Corpus model:
|
||||
// - N unique terms: t000000, t000001, ...
|
||||
// - M docs
|
||||
// - K tokens per doc: doc i gets terms derived from (i, token_index)
|
||||
//
|
||||
// Query:
|
||||
// - Regex "t.*" to match all terms
|
||||
//
|
||||
// Run with:
|
||||
// - cargo bench --bench regex_all_terms
|
||||
//
|
||||
|
||||
use std::fmt::Write;
|
||||
|
||||
use binggan::{black_box, BenchRunner};
|
||||
use tantivy::collector::Count;
|
||||
use tantivy::query::RegexQuery;
|
||||
use tantivy::schema::{Schema, TEXT};
|
||||
use tantivy::{doc, Index, ReloadPolicy};
|
||||
|
||||
const HEAP_SIZE_BYTES: usize = 200_000_000;
|
||||
|
||||
#[derive(Clone, Copy)]
|
||||
struct BenchConfig {
|
||||
num_terms: usize,
|
||||
num_docs: usize,
|
||||
tokens_per_doc: usize,
|
||||
}
|
||||
|
||||
fn main() {
|
||||
let configs = default_configs();
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for config in configs {
|
||||
let (index, text_field) = build_index(config, HEAP_SIZE_BYTES);
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.expect("reader");
|
||||
let searcher = reader.searcher();
|
||||
let query = RegexQuery::from_pattern("t.*", text_field).expect("regex query");
|
||||
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(format!(
|
||||
"regex_all_terms_t{}_d{}_k{}",
|
||||
config.num_terms, config.num_docs, config.tokens_per_doc
|
||||
));
|
||||
group.register("regex_count", move |_| {
|
||||
let count = searcher.search(&query, &Count).expect("search");
|
||||
black_box(count);
|
||||
});
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
|
||||
fn default_configs() -> Vec<BenchConfig> {
|
||||
vec![
|
||||
BenchConfig {
|
||||
num_terms: 10_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 1,
|
||||
},
|
||||
BenchConfig {
|
||||
num_terms: 10_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 8,
|
||||
},
|
||||
BenchConfig {
|
||||
num_terms: 100_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 1,
|
||||
},
|
||||
BenchConfig {
|
||||
num_terms: 100_000,
|
||||
num_docs: 100_000,
|
||||
tokens_per_doc: 8,
|
||||
},
|
||||
]
|
||||
}
|
||||
|
||||
fn build_index(config: BenchConfig, heap_size_bytes: usize) -> (Index, tantivy::schema::Field) {
|
||||
let mut schema_builder = Schema::builder();
|
||||
let text_field = schema_builder.add_text_field("text", TEXT);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema);
|
||||
|
||||
let term_width = config.num_terms.to_string().len();
|
||||
{
|
||||
let mut writer = index
|
||||
.writer_with_num_threads(1, heap_size_bytes)
|
||||
.expect("writer");
|
||||
let mut buffer = String::new();
|
||||
for doc_id in 0..config.num_docs {
|
||||
buffer.clear();
|
||||
for token_idx in 0..config.tokens_per_doc {
|
||||
if token_idx > 0 {
|
||||
buffer.push(' ');
|
||||
}
|
||||
let term_id = (doc_id * config.tokens_per_doc + token_idx) % config.num_terms;
|
||||
write!(&mut buffer, "t{term_id:0term_width$}").expect("write token");
|
||||
}
|
||||
writer
|
||||
.add_document(doc!(text_field => buffer.as_str()))
|
||||
.expect("add_document");
|
||||
}
|
||||
writer.commit().expect("commit");
|
||||
}
|
||||
|
||||
(index, text_field)
|
||||
}
|
||||
@@ -1,421 +0,0 @@
|
||||
// This benchmark compares different approaches for retrieving string values:
|
||||
//
|
||||
// 1. Fast Field Approach: retrieves string values via term_ords() and ord_to_str()
|
||||
//
|
||||
// 2. Doc Store Approach: retrieves string values via searcher.doc() and field extraction
|
||||
//
|
||||
// The benchmark includes various data distributions:
|
||||
// - Dense Sequential: Sequential document IDs with dense data
|
||||
// - Dense Random: Random document IDs with dense data
|
||||
// - Sparse Sequential: Sequential document IDs with sparse data
|
||||
// - Sparse Random: Random document IDs with sparse data
|
||||
use std::ops::Bound;
|
||||
|
||||
use binggan::{black_box, BenchGroup, BenchRunner};
|
||||
use rand::prelude::*;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::SeedableRng;
|
||||
use tantivy::collector::{Count, DocSetCollector};
|
||||
use tantivy::query::RangeQuery;
|
||||
use tantivy::schema::document::TantivyDocument;
|
||||
use tantivy::schema::{Schema, Value, FAST, STORED, STRING};
|
||||
use tantivy::{doc, Index, ReloadPolicy, Searcher, Term};
|
||||
|
||||
#[derive(Clone)]
|
||||
struct BenchIndex {
|
||||
#[allow(dead_code)]
|
||||
index: Index,
|
||||
searcher: Searcher,
|
||||
}
|
||||
|
||||
fn build_shared_indices(num_docs: usize, distribution: &str) -> BenchIndex {
|
||||
// Schema with string fast field and stored field for doc access
|
||||
let mut schema_builder = Schema::builder();
|
||||
let f_str_fast = schema_builder.add_text_field("str_fast", STRING | STORED | FAST);
|
||||
let f_str_stored = schema_builder.add_text_field("str_stored", STRING | STORED);
|
||||
let schema = schema_builder.build();
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
|
||||
// Populate index with stable RNG for reproducibility.
|
||||
let mut rng = StdRng::from_seed([7u8; 32]);
|
||||
|
||||
{
|
||||
let mut writer = index.writer_with_num_threads(1, 4_000_000_000).unwrap();
|
||||
|
||||
match distribution {
|
||||
"dense_random" => {
|
||||
for _doc_id in 0..num_docs {
|
||||
let suffix = rng.random_range(0u64..1000u64);
|
||||
let str_val = format!("str_{:03}", suffix);
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_str_fast=>str_val.clone(),
|
||||
f_str_stored=>str_val,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
"dense_sequential" => {
|
||||
for doc_id in 0..num_docs {
|
||||
let suffix = doc_id as u64 % 1000;
|
||||
let str_val = format!("str_{:03}", suffix);
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_str_fast=>str_val.clone(),
|
||||
f_str_stored=>str_val,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
"sparse_random" => {
|
||||
for _doc_id in 0..num_docs {
|
||||
let suffix = rng.random_range(0u64..1000000u64);
|
||||
let str_val = format!("str_{:07}", suffix);
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_str_fast=>str_val.clone(),
|
||||
f_str_stored=>str_val,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
"sparse_sequential" => {
|
||||
for doc_id in 0..num_docs {
|
||||
let suffix = doc_id as u64;
|
||||
let str_val = format!("str_{:07}", suffix);
|
||||
|
||||
writer
|
||||
.add_document(doc!(
|
||||
f_str_fast=>str_val.clone(),
|
||||
f_str_stored=>str_val,
|
||||
))
|
||||
.unwrap();
|
||||
}
|
||||
}
|
||||
_ => {
|
||||
panic!("Unsupported distribution type");
|
||||
}
|
||||
}
|
||||
writer.commit().unwrap();
|
||||
}
|
||||
|
||||
// Prepare reader/searcher once.
|
||||
let reader = index
|
||||
.reader_builder()
|
||||
.reload_policy(ReloadPolicy::Manual)
|
||||
.try_into()
|
||||
.unwrap();
|
||||
let searcher = reader.searcher();
|
||||
|
||||
BenchIndex { index, searcher }
|
||||
}
|
||||
|
||||
fn main() {
|
||||
// Prepare corpora with varying scenarios
|
||||
let scenarios = vec![
|
||||
(
|
||||
"dense_random_search_low_range".to_string(),
|
||||
1_000_000,
|
||||
"dense_random",
|
||||
0,
|
||||
9,
|
||||
),
|
||||
(
|
||||
"dense_random_search_high_range".to_string(),
|
||||
1_000_000,
|
||||
"dense_random",
|
||||
990,
|
||||
999,
|
||||
),
|
||||
(
|
||||
"dense_sequential_search_low_range".to_string(),
|
||||
1_000_000,
|
||||
"dense_sequential",
|
||||
0,
|
||||
9,
|
||||
),
|
||||
(
|
||||
"dense_sequential_search_high_range".to_string(),
|
||||
1_000_000,
|
||||
"dense_sequential",
|
||||
990,
|
||||
999,
|
||||
),
|
||||
(
|
||||
"sparse_random_search_low_range".to_string(),
|
||||
1_000_000,
|
||||
"sparse_random",
|
||||
0,
|
||||
9999,
|
||||
),
|
||||
(
|
||||
"sparse_random_search_high_range".to_string(),
|
||||
1_000_000,
|
||||
"sparse_random",
|
||||
990_000,
|
||||
999_999,
|
||||
),
|
||||
(
|
||||
"sparse_sequential_search_low_range".to_string(),
|
||||
1_000_000,
|
||||
"sparse_sequential",
|
||||
0,
|
||||
9999,
|
||||
),
|
||||
(
|
||||
"sparse_sequential_search_high_range".to_string(),
|
||||
1_000_000,
|
||||
"sparse_sequential",
|
||||
990_000,
|
||||
999_999,
|
||||
),
|
||||
];
|
||||
|
||||
let mut runner = BenchRunner::new();
|
||||
for (scenario_id, n, distribution, range_low, range_high) in scenarios {
|
||||
let bench_index = build_shared_indices(n, distribution);
|
||||
let mut group = runner.new_group();
|
||||
group.set_name(scenario_id);
|
||||
|
||||
let field = bench_index.searcher.schema().get_field("str_fast").unwrap();
|
||||
|
||||
let (lower_str, upper_str) =
|
||||
if distribution == "dense_sequential" || distribution == "dense_random" {
|
||||
(
|
||||
format!("str_{:03}", range_low),
|
||||
format!("str_{:03}", range_high),
|
||||
)
|
||||
} else {
|
||||
(
|
||||
format!("str_{:07}", range_low),
|
||||
format!("str_{:07}", range_high),
|
||||
)
|
||||
};
|
||||
|
||||
let lower_term = Term::from_field_text(field, &lower_str);
|
||||
let upper_term = Term::from_field_text(field, &upper_str);
|
||||
|
||||
let query = RangeQuery::new(Bound::Included(lower_term), Bound::Included(upper_term));
|
||||
|
||||
run_benchmark_tasks(&mut group, &bench_index, query, range_low, range_high);
|
||||
|
||||
group.run();
|
||||
}
|
||||
}
|
||||
|
||||
/// Run all benchmark tasks for a given range query
|
||||
fn run_benchmark_tasks(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
// Test count of matching documents
|
||||
add_bench_task_count(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query.clone(),
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
|
||||
// Test fetching all DocIds of matching documents
|
||||
add_bench_task_docset(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query.clone(),
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
|
||||
// Test fetching all string fast field values of matching documents
|
||||
add_bench_task_fetch_all_strings(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query.clone(),
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
|
||||
// Test fetching all string values of matching documents through doc() method
|
||||
add_bench_task_fetch_all_strings_from_doc(
|
||||
bench_group,
|
||||
bench_index,
|
||||
query,
|
||||
range_low,
|
||||
range_high,
|
||||
);
|
||||
}
|
||||
|
||||
fn add_bench_task_count(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!("string_search_count_[{}-{}]", range_low, range_high);
|
||||
|
||||
let search_task = CountSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
fn add_bench_task_docset(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!("string_fetch_all_docset_[{}-{}]", range_low, range_high);
|
||||
|
||||
let search_task = DocSetSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
bench_group.register(task_name, move |_| black_box(search_task.run()));
|
||||
}
|
||||
|
||||
fn add_bench_task_fetch_all_strings(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!(
|
||||
"string_fastfield_fetch_all_strings_[{}-{}]",
|
||||
range_low, range_high
|
||||
);
|
||||
|
||||
let search_task = FetchAllStringsSearchTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
|
||||
bench_group.register(task_name, move |_| {
|
||||
let result = black_box(search_task.run());
|
||||
result.len()
|
||||
});
|
||||
}
|
||||
|
||||
fn add_bench_task_fetch_all_strings_from_doc(
|
||||
bench_group: &mut BenchGroup,
|
||||
bench_index: &BenchIndex,
|
||||
query: RangeQuery,
|
||||
range_low: u64,
|
||||
range_high: u64,
|
||||
) {
|
||||
let task_name = format!(
|
||||
"string_doc_fetch_all_strings_[{}-{}]",
|
||||
range_low, range_high
|
||||
);
|
||||
|
||||
let search_task = FetchAllStringsFromDocTask {
|
||||
searcher: bench_index.searcher.clone(),
|
||||
query,
|
||||
};
|
||||
|
||||
bench_group.register(task_name, move |_| {
|
||||
let result = black_box(search_task.run());
|
||||
result.len()
|
||||
});
|
||||
}
|
||||
|
||||
struct CountSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl CountSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
self.searcher.search(&self.query, &Count).unwrap()
|
||||
}
|
||||
}
|
||||
|
||||
struct DocSetSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl DocSetSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> usize {
|
||||
let result = self.searcher.search(&self.query, &DocSetCollector).unwrap();
|
||||
result.len()
|
||||
}
|
||||
}
|
||||
|
||||
struct FetchAllStringsSearchTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl FetchAllStringsSearchTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> Vec<String> {
|
||||
let doc_addresses = self.searcher.search(&self.query, &DocSetCollector).unwrap();
|
||||
let mut docs = doc_addresses.into_iter().collect::<Vec<_>>();
|
||||
docs.sort();
|
||||
let mut strings = Vec::with_capacity(docs.len());
|
||||
|
||||
for doc_address in docs {
|
||||
let segment_reader = &self.searcher.segment_readers()[doc_address.segment_ord as usize];
|
||||
let str_column_opt = segment_reader.fast_fields().str("str_fast");
|
||||
|
||||
if let Ok(Some(str_column)) = str_column_opt {
|
||||
let doc_id = doc_address.doc_id;
|
||||
let term_ord = str_column.term_ords(doc_id).next().unwrap();
|
||||
let mut str_buffer = String::new();
|
||||
if str_column.ord_to_str(term_ord, &mut str_buffer).is_ok() {
|
||||
strings.push(str_buffer);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
strings
|
||||
}
|
||||
}
|
||||
|
||||
struct FetchAllStringsFromDocTask {
|
||||
searcher: Searcher,
|
||||
query: RangeQuery,
|
||||
}
|
||||
|
||||
impl FetchAllStringsFromDocTask {
|
||||
#[inline(never)]
|
||||
pub fn run(&self) -> Vec<String> {
|
||||
let doc_addresses = self.searcher.search(&self.query, &DocSetCollector).unwrap();
|
||||
let mut docs = doc_addresses.into_iter().collect::<Vec<_>>();
|
||||
docs.sort();
|
||||
let mut strings = Vec::with_capacity(docs.len());
|
||||
|
||||
let str_stored_field = self
|
||||
.searcher
|
||||
.schema()
|
||||
.get_field("str_stored")
|
||||
.expect("str_stored field should exist");
|
||||
|
||||
for doc_address in docs {
|
||||
// Get the document from the doc store (row store access)
|
||||
if let Ok(doc) = self.searcher.doc::<TantivyDocument>(doc_address) {
|
||||
// Extract string values from the stored field
|
||||
if let Some(field_value) = doc.get_first(str_stored_field) {
|
||||
if let Some(text) = field_value.as_value().as_str() {
|
||||
strings.push(text.to_string());
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
strings
|
||||
}
|
||||
}
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy-bitpacker"
|
||||
version = "0.10.0"
|
||||
version = "0.9.0"
|
||||
edition = "2024"
|
||||
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
|
||||
license = "MIT"
|
||||
@@ -18,10 +18,5 @@ homepage = "https://github.com/quickwit-oss/tantivy"
|
||||
bitpacking = { version = "0.9.2", default-features = false, features = ["bitpacker1x"] }
|
||||
|
||||
[dev-dependencies]
|
||||
binggan = "0.17.0"
|
||||
rand = "0.9"
|
||||
rand = "0.8"
|
||||
proptest = "1"
|
||||
|
||||
[[bench]]
|
||||
name = "bench"
|
||||
harness = false
|
||||
|
||||
@@ -1,110 +1,65 @@
|
||||
use std::cell::RefCell;
|
||||
#![feature(test)]
|
||||
|
||||
use binggan::{BenchRunner, black_box};
|
||||
use rand::rng;
|
||||
use rand::seq::IteratorRandom;
|
||||
use tantivy_bitpacker::{BitPacker, BitUnpacker, BlockedBitpacker};
|
||||
extern crate test;
|
||||
|
||||
fn create_bitpacked_data(bit_width: u8, num_els: u32) -> Vec<u8> {
|
||||
let mut bitpacker = BitPacker::new();
|
||||
let mut buffer = Vec::new();
|
||||
for _ in 0..num_els {
|
||||
bitpacker.write(0u64, bit_width, &mut buffer).unwrap();
|
||||
bitpacker.flush(&mut buffer).unwrap();
|
||||
}
|
||||
buffer
|
||||
}
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use rand::seq::IteratorRandom;
|
||||
use rand::thread_rng;
|
||||
use tantivy_bitpacker::{BitPacker, BitUnpacker, BlockedBitpacker};
|
||||
use test::Bencher;
|
||||
|
||||
const N: usize = 100_000;
|
||||
const MAX_VAL: u64 = 1_000;
|
||||
const BIT_WIDTH: u8 = 10; // 2^10 = 1024 > MAX_VAL
|
||||
|
||||
fn create_packed_data() -> (BitUnpacker, Vec<u8>) {
|
||||
let mut bitpacker = BitPacker::new();
|
||||
let mut data = Vec::new();
|
||||
for i in 0..N as u64 {
|
||||
let val = i * MAX_VAL / N as u64;
|
||||
bitpacker.write(val, BIT_WIDTH, &mut data).unwrap();
|
||||
}
|
||||
bitpacker.close(&mut data).unwrap();
|
||||
(BitUnpacker::new(BIT_WIDTH), data)
|
||||
}
|
||||
|
||||
fn bench_bitpacking() {
|
||||
let mut runner = BenchRunner::new();
|
||||
let bit_width = 3;
|
||||
let num_els = 1_000_000u32;
|
||||
let bit_unpacker = BitUnpacker::new(bit_width);
|
||||
let data = create_bitpacked_data(bit_width, num_els);
|
||||
let idxs: Vec<u32> = (0..num_els).choose_multiple(&mut rng(), 100_000);
|
||||
runner.bench_function("bitpacking_read", move |_| {
|
||||
let mut out = 0u64;
|
||||
for &idx in &idxs {
|
||||
out = out.wrapping_add(bit_unpacker.get(idx, &data[..]));
|
||||
#[inline(never)]
|
||||
fn create_bitpacked_data(bit_width: u8, num_els: u32) -> Vec<u8> {
|
||||
let mut bitpacker = BitPacker::new();
|
||||
let mut buffer = Vec::new();
|
||||
for _ in 0..num_els {
|
||||
// the values do not matter.
|
||||
bitpacker.write(0u64, bit_width, &mut buffer).unwrap();
|
||||
bitpacker.flush(&mut buffer).unwrap();
|
||||
}
|
||||
black_box(out);
|
||||
});
|
||||
}
|
||||
|
||||
fn bench_blocked_bitpacker() {
|
||||
let mut runner = BenchRunner::new();
|
||||
let mut blocked_bitpacker = BlockedBitpacker::new();
|
||||
for val in 0..=21500 {
|
||||
blocked_bitpacker.add(val * val);
|
||||
buffer
|
||||
}
|
||||
runner.bench_function("blockedbitp_read", move |_| {
|
||||
let mut out = 0u64;
|
||||
for val in 0..=21500 {
|
||||
out = out.wrapping_add(blocked_bitpacker.get(val));
|
||||
}
|
||||
black_box(out);
|
||||
});
|
||||
runner.bench_function("blockedbitp_create", |_| {
|
||||
|
||||
#[bench]
|
||||
fn bench_bitpacking_read(b: &mut Bencher) {
|
||||
let bit_width = 3;
|
||||
let num_els = 1_000_000u32;
|
||||
let bit_unpacker = BitUnpacker::new(bit_width);
|
||||
let data = create_bitpacked_data(bit_width, num_els);
|
||||
let idxs: Vec<u32> = (0..num_els).choose_multiple(&mut thread_rng(), 100_000);
|
||||
b.iter(|| {
|
||||
let mut out = 0u64;
|
||||
for &idx in &idxs {
|
||||
out = out.wrapping_add(bit_unpacker.get(idx, &data[..]));
|
||||
}
|
||||
out
|
||||
});
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_blockedbitp_read(b: &mut Bencher) {
|
||||
let mut blocked_bitpacker = BlockedBitpacker::new();
|
||||
for val in 0..=21500 {
|
||||
blocked_bitpacker.add(val * val);
|
||||
}
|
||||
black_box(blocked_bitpacker);
|
||||
});
|
||||
}
|
||||
|
||||
fn bench_filter_vec() {
|
||||
let mut runner = BenchRunner::new();
|
||||
|
||||
let (unpacker, data) = create_packed_data();
|
||||
let positions = RefCell::new(Vec::with_capacity(N));
|
||||
runner.bench_function("filter_vec_dense", move |_| {
|
||||
unpacker.get_ids_for_value_range(
|
||||
250..=750,
|
||||
0..N as u32,
|
||||
&data,
|
||||
&mut positions.borrow_mut(),
|
||||
);
|
||||
black_box(positions.borrow().len());
|
||||
});
|
||||
|
||||
let (unpacker, data) = create_packed_data();
|
||||
let positions = RefCell::new(Vec::with_capacity(N));
|
||||
runner.bench_function("filter_vec_sparse", move |_| {
|
||||
unpacker.get_ids_for_value_range(0..=50, 0..N as u32, &data, &mut positions.borrow_mut());
|
||||
black_box(positions.borrow().len());
|
||||
});
|
||||
|
||||
let (unpacker, data) = create_packed_data();
|
||||
let positions = RefCell::new(Vec::with_capacity(N));
|
||||
runner.bench_function("filter_vec_full", move |_| {
|
||||
unpacker.get_ids_for_value_range(
|
||||
0..=MAX_VAL,
|
||||
0..N as u32,
|
||||
&data,
|
||||
&mut positions.borrow_mut(),
|
||||
);
|
||||
black_box(positions.borrow().len());
|
||||
});
|
||||
}
|
||||
|
||||
fn main() {
|
||||
bench_bitpacking();
|
||||
bench_blocked_bitpacker();
|
||||
bench_filter_vec();
|
||||
b.iter(|| {
|
||||
let mut out = 0u64;
|
||||
for val in 0..=21500 {
|
||||
out = out.wrapping_add(blocked_bitpacker.get(val));
|
||||
}
|
||||
out
|
||||
});
|
||||
}
|
||||
|
||||
#[bench]
|
||||
fn bench_blockedbitp_create(b: &mut Bencher) {
|
||||
b.iter(|| {
|
||||
let mut blocked_bitpacker = BlockedBitpacker::new();
|
||||
for val in 0..=21500 {
|
||||
blocked_bitpacker.add(val * val);
|
||||
}
|
||||
blocked_bitpacker
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
@@ -48,7 +48,7 @@ impl BitPacker {
|
||||
|
||||
pub fn flush<TWrite: io::Write + ?Sized>(&mut self, output: &mut TWrite) -> io::Result<()> {
|
||||
if self.mini_buffer_written > 0 {
|
||||
let num_bytes = self.mini_buffer_written.div_ceil(8);
|
||||
let num_bytes = (self.mini_buffer_written + 7) / 8;
|
||||
let bytes = self.mini_buffer.to_le_bytes();
|
||||
output.write_all(&bytes[..num_bytes])?;
|
||||
self.mini_buffer_written = 0;
|
||||
@@ -138,7 +138,7 @@ impl BitUnpacker {
|
||||
|
||||
// We use `usize` here to avoid overflow issues.
|
||||
let end_bit_read = (end_idx as usize) * self.num_bits;
|
||||
let end_byte_read = end_bit_read.div_ceil(8);
|
||||
let end_byte_read = (end_bit_read + 7) / 8;
|
||||
assert!(
|
||||
end_byte_read <= data.len(),
|
||||
"Requested index is out of bounds."
|
||||
@@ -258,7 +258,7 @@ mod test {
|
||||
bitpacker.write(val, num_bits, &mut data).unwrap();
|
||||
}
|
||||
bitpacker.close(&mut data).unwrap();
|
||||
assert_eq!(data.len(), ((num_bits as usize) * len).div_ceil(8));
|
||||
assert_eq!(data.len(), ((num_bits as usize) * len + 7) / 8);
|
||||
let bitunpacker = BitUnpacker::new(num_bits);
|
||||
(bitunpacker, vals, data)
|
||||
}
|
||||
@@ -304,7 +304,7 @@ mod test {
|
||||
bitpacker.write(val, num_bits, &mut buffer).unwrap();
|
||||
}
|
||||
bitpacker.flush(&mut buffer).unwrap();
|
||||
assert_eq!(buffer.len(), (vals.len() * num_bits as usize).div_ceil(8));
|
||||
assert_eq!(buffer.len(), (vals.len() * num_bits as usize + 7) / 8);
|
||||
let bitunpacker = BitUnpacker::new(num_bits);
|
||||
let max_val = if num_bits == 64 {
|
||||
u64::MAX
|
||||
|
||||
@@ -140,10 +140,10 @@ impl BlockedBitpacker {
|
||||
pub fn iter(&self) -> impl Iterator<Item = u64> + '_ {
|
||||
// todo performance: we could decompress a whole block and cache it instead
|
||||
let bitpacked_elems = self.offset_and_bits.len() * BLOCK_SIZE;
|
||||
|
||||
(0..bitpacked_elems)
|
||||
let iter = (0..bitpacked_elems)
|
||||
.map(move |idx| self.get(idx))
|
||||
.chain(self.buffer.iter().cloned())
|
||||
.chain(self.buffer.iter().cloned());
|
||||
iter
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -19,7 +19,7 @@ fn u32_to_i32(val: u32) -> i32 {
|
||||
#[inline]
|
||||
unsafe fn u32_to_i32_avx2(vals_u32x8s: DataType) -> DataType {
|
||||
const HIGHEST_BIT_MASK: DataType = from_u32x8([HIGHEST_BIT; NUM_LANES]);
|
||||
unsafe { op_xor(vals_u32x8s, HIGHEST_BIT_MASK) }
|
||||
op_xor(vals_u32x8s, HIGHEST_BIT_MASK)
|
||||
}
|
||||
|
||||
pub fn filter_vec_in_place(range: RangeInclusive<u32>, offset: u32, output: &mut Vec<u32>) {
|
||||
@@ -66,19 +66,17 @@ unsafe fn filter_vec_avx2_aux(
|
||||
]);
|
||||
const SHIFT: __m256i = from_u32x8([NUM_LANES as u32; NUM_LANES]);
|
||||
for _ in 0..num_words {
|
||||
unsafe {
|
||||
let word = load_unaligned(input);
|
||||
let word = u32_to_i32_avx2(word);
|
||||
let keeper_bitset = compute_filter_bitset(word, range_simd.clone());
|
||||
let added_len = keeper_bitset.count_ones();
|
||||
let filtered_doc_ids = compact(ids, keeper_bitset);
|
||||
store_unaligned(output_tail as *mut __m256i, filtered_doc_ids);
|
||||
output_tail = output_tail.offset(added_len as isize);
|
||||
ids = op_add(ids, SHIFT);
|
||||
input = input.offset(1);
|
||||
}
|
||||
let word = load_unaligned(input);
|
||||
let word = u32_to_i32_avx2(word);
|
||||
let keeper_bitset = compute_filter_bitset(word, range_simd.clone());
|
||||
let added_len = keeper_bitset.count_ones();
|
||||
let filtered_doc_ids = compact(ids, keeper_bitset);
|
||||
store_unaligned(output_tail as *mut __m256i, filtered_doc_ids);
|
||||
output_tail = output_tail.offset(added_len as isize);
|
||||
ids = op_add(ids, SHIFT);
|
||||
input = input.offset(1);
|
||||
}
|
||||
unsafe { output_tail.offset_from(output) as usize }
|
||||
output_tail.offset_from(output) as usize
|
||||
}
|
||||
|
||||
#[inline]
|
||||
@@ -94,7 +92,8 @@ unsafe fn compute_filter_bitset(val: __m256i, range: std::ops::RangeInclusive<__
|
||||
let too_low = op_greater(*range.start(), val);
|
||||
let too_high = op_greater(val, *range.end());
|
||||
let inside = op_or(too_low, too_high);
|
||||
255 - std::arch::x86_64::_mm256_movemask_ps(_mm256_castsi256_ps(inside)) as u8
|
||||
255 - std::arch::x86_64::_mm256_movemask_ps(std::mem::transmute::<DataType, __m256>(inside))
|
||||
as u8
|
||||
}
|
||||
|
||||
union U8x32 {
|
||||
|
||||
@@ -1,17 +1,8 @@
|
||||
#[cfg(all(target_arch = "aarch64", not(target_vendor = "apple")))]
|
||||
use std::arch::is_aarch64_feature_detected;
|
||||
use std::ops::RangeInclusive;
|
||||
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
mod avx2;
|
||||
|
||||
#[cfg(target_arch = "aarch64")]
|
||||
mod neon;
|
||||
|
||||
// SVE intrinsics are not exposed on aarch64-apple-darwin.
|
||||
#[cfg(all(target_arch = "aarch64", not(target_vendor = "apple")))]
|
||||
mod sve;
|
||||
|
||||
mod scalar;
|
||||
|
||||
#[derive(Clone, Copy, Eq, PartialEq, Debug)]
|
||||
@@ -19,10 +10,6 @@ mod scalar;
|
||||
enum FilterImplPerInstructionSet {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
AVX2 = 0u8,
|
||||
#[cfg(all(target_arch = "aarch64", not(target_vendor = "apple")))]
|
||||
SVE = 3u8,
|
||||
#[cfg(target_arch = "aarch64")]
|
||||
Neon = 2u8,
|
||||
Scalar = 1u8,
|
||||
}
|
||||
|
||||
@@ -32,57 +19,29 @@ impl FilterImplPerInstructionSet {
|
||||
match *self {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
FilterImplPerInstructionSet::AVX2 => is_x86_feature_detected!("avx2"),
|
||||
#[cfg(all(target_arch = "aarch64", not(target_vendor = "apple")))]
|
||||
FilterImplPerInstructionSet::SVE => is_aarch64_feature_detected!("sve"),
|
||||
// TIL Neon is required on aarch 64.
|
||||
#[cfg(target_arch = "aarch64")]
|
||||
FilterImplPerInstructionSet::Neon => true,
|
||||
FilterImplPerInstructionSet::Scalar => true,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// List of available implementations in preferred order.
|
||||
// List of available implementation in preferred order.
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
const IMPLS: [FilterImplPerInstructionSet; 2] = [
|
||||
FilterImplPerInstructionSet::AVX2,
|
||||
FilterImplPerInstructionSet::Scalar,
|
||||
];
|
||||
|
||||
// Non-Apple aarch64: try SVE, NEON, Scalar.
|
||||
#[cfg(all(target_arch = "aarch64", not(target_vendor = "apple")))]
|
||||
const IMPLS: [FilterImplPerInstructionSet; 3] = [
|
||||
FilterImplPerInstructionSet::SVE,
|
||||
FilterImplPerInstructionSet::Neon,
|
||||
FilterImplPerInstructionSet::Scalar,
|
||||
];
|
||||
|
||||
// Apple aarch64 (M-series): SVE not available; use NEON or Scalar.
|
||||
#[cfg(all(target_arch = "aarch64", target_vendor = "apple"))]
|
||||
const IMPLS: [FilterImplPerInstructionSet; 2] = [
|
||||
FilterImplPerInstructionSet::Neon,
|
||||
FilterImplPerInstructionSet::Scalar,
|
||||
];
|
||||
|
||||
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
|
||||
#[cfg(not(target_arch = "x86_64"))]
|
||||
const IMPLS: [FilterImplPerInstructionSet; 1] = [FilterImplPerInstructionSet::Scalar];
|
||||
|
||||
impl FilterImplPerInstructionSet {
|
||||
#[inline]
|
||||
#[allow(unused_variables)]
|
||||
#[allow(unused_variables)] // on non-x86_64, code is unused.
|
||||
fn from(code: u8) -> FilterImplPerInstructionSet {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
if code == FilterImplPerInstructionSet::AVX2 as u8 {
|
||||
return FilterImplPerInstructionSet::AVX2;
|
||||
}
|
||||
#[cfg(all(target_arch = "aarch64", not(target_vendor = "apple")))]
|
||||
if code == FilterImplPerInstructionSet::SVE as u8 {
|
||||
return FilterImplPerInstructionSet::SVE;
|
||||
}
|
||||
#[cfg(target_arch = "aarch64")]
|
||||
if code == FilterImplPerInstructionSet::Neon as u8 {
|
||||
return FilterImplPerInstructionSet::Neon;
|
||||
}
|
||||
FilterImplPerInstructionSet::Scalar
|
||||
}
|
||||
|
||||
@@ -91,13 +50,6 @@ impl FilterImplPerInstructionSet {
|
||||
match self {
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
FilterImplPerInstructionSet::AVX2 => avx2::filter_vec_in_place(range, offset, output),
|
||||
#[cfg(all(target_arch = "aarch64", not(target_vendor = "apple")))]
|
||||
// SAFETY: SVE availability was verified by is_available() before selecting this impl.
|
||||
FilterImplPerInstructionSet::SVE => unsafe {
|
||||
sve::filter_vec_in_place(range, offset, output)
|
||||
},
|
||||
#[cfg(target_arch = "aarch64")]
|
||||
FilterImplPerInstructionSet::Neon => neon::filter_vec_in_place(range, offset, output),
|
||||
FilterImplPerInstructionSet::Scalar => {
|
||||
scalar::filter_vec_in_place(range, offset, output)
|
||||
}
|
||||
@@ -105,12 +57,6 @@ impl FilterImplPerInstructionSet {
|
||||
}
|
||||
}
|
||||
|
||||
fn available_impls() -> impl Iterator<Item = FilterImplPerInstructionSet> {
|
||||
IMPLS
|
||||
.into_iter()
|
||||
.filter(FilterImplPerInstructionSet::is_available)
|
||||
}
|
||||
|
||||
#[inline]
|
||||
fn get_best_available_instruction_set() -> FilterImplPerInstructionSet {
|
||||
use std::sync::atomic::{AtomicU8, Ordering};
|
||||
@@ -118,7 +64,10 @@ fn get_best_available_instruction_set() -> FilterImplPerInstructionSet {
|
||||
let instruction_set_byte: u8 = INSTRUCTION_SET_BYTE.load(Ordering::Relaxed);
|
||||
if instruction_set_byte == u8::MAX {
|
||||
// Let's initialize the instruction set and cache it.
|
||||
let instruction_set = available_impls().next().unwrap();
|
||||
let instruction_set = IMPLS
|
||||
.into_iter()
|
||||
.find(FilterImplPerInstructionSet::is_available)
|
||||
.unwrap();
|
||||
INSTRUCTION_SET_BYTE.store(instruction_set as u8, Ordering::Relaxed);
|
||||
return instruction_set;
|
||||
}
|
||||
@@ -131,12 +80,12 @@ pub fn filter_vec_in_place(range: RangeInclusive<u32>, offset: u32, output: &mut
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use proptest::strategy::Strategy;
|
||||
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_get_best_available_instruction_set() {
|
||||
// This does not test much unfortunately.
|
||||
// We just make sure the function returns without crashing and returns the same result.
|
||||
let instruction_set = get_best_available_instruction_set();
|
||||
assert_eq!(get_best_available_instruction_set(), instruction_set);
|
||||
}
|
||||
@@ -153,31 +102,6 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(all(target_arch = "aarch64", not(target_vendor = "apple")))]
|
||||
#[test]
|
||||
fn test_instruction_set_to_code_from_code() {
|
||||
for instruction_set in [
|
||||
FilterImplPerInstructionSet::SVE,
|
||||
FilterImplPerInstructionSet::Neon,
|
||||
FilterImplPerInstructionSet::Scalar,
|
||||
] {
|
||||
let code = instruction_set as u8;
|
||||
assert_eq!(instruction_set, FilterImplPerInstructionSet::from(code));
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(all(target_arch = "aarch64", target_vendor = "apple"))]
|
||||
#[test]
|
||||
fn test_instruction_set_to_code_from_code() {
|
||||
for instruction_set in [
|
||||
FilterImplPerInstructionSet::Neon,
|
||||
FilterImplPerInstructionSet::Scalar,
|
||||
] {
|
||||
let code = instruction_set as u8;
|
||||
assert_eq!(instruction_set, FilterImplPerInstructionSet::from(code));
|
||||
}
|
||||
}
|
||||
|
||||
fn test_filter_impl_empty_aux(filter_impl: FilterImplPerInstructionSet) {
|
||||
let mut output = vec![];
|
||||
filter_impl.filter_vec_in_place(0..=u32::MAX, 0, &mut output);
|
||||
@@ -202,20 +126,11 @@ mod tests {
|
||||
assert_eq!(&output, &[1, 3, 4, 5, 6, 7, 8]);
|
||||
}
|
||||
|
||||
fn test_filter_impl_empty_range_aux(filter_impl: FilterImplPerInstructionSet) {
|
||||
// start > end: RangeInclusive::contains always returns false; output must be empty.
|
||||
// The SVE path's wrapping_sub would otherwise produce a huge range_width.
|
||||
let mut output = vec![3, 2, 1, 5, 11, 2, 5, 10, 2];
|
||||
filter_impl.filter_vec_in_place(10..=5, 0, &mut output);
|
||||
assert_eq!(&output, &[]);
|
||||
}
|
||||
|
||||
fn test_filter_impl_test_suite(filter_impl: FilterImplPerInstructionSet) {
|
||||
test_filter_impl_empty_aux(filter_impl);
|
||||
test_filter_impl_simple_aux(filter_impl);
|
||||
test_filter_impl_simple_aux_shifted(filter_impl);
|
||||
test_filter_impl_simple_outside_i32_range(filter_impl);
|
||||
test_filter_impl_empty_range_aux(filter_impl);
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -226,60 +141,25 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[cfg(all(target_arch = "aarch64", not(target_vendor = "apple")))]
|
||||
fn test_filter_implementation_sve() {
|
||||
if FilterImplPerInstructionSet::SVE.is_available() {
|
||||
test_filter_impl_test_suite(FilterImplPerInstructionSet::SVE);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[cfg(target_arch = "aarch64")]
|
||||
fn test_filter_implementation_neon() {
|
||||
test_filter_impl_test_suite(FilterImplPerInstructionSet::Neon);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_filter_implementation_scalar() {
|
||||
test_filter_impl_test_suite(FilterImplPerInstructionSet::Scalar);
|
||||
}
|
||||
|
||||
fn max_val_strategy() -> impl proptest::strategy::Strategy<Value = u32> {
|
||||
proptest::prop_oneof![
|
||||
0u32..10u32,
|
||||
255u32..258u32,
|
||||
proptest::prelude::Just(1u32 << 25),
|
||||
proptest::prelude::Just(u32::MAX - 1),
|
||||
proptest::prelude::Just(u32::MAX),
|
||||
]
|
||||
}
|
||||
|
||||
fn vals_strategy() -> impl proptest::strategy::Strategy<Value = Vec<u32>> {
|
||||
proptest::prop_oneof![
|
||||
proptest::collection::vec(proptest::prelude::any::<u32>(), 0..300),
|
||||
max_val_strategy()
|
||||
.prop_flat_map(|max_val| { proptest::collection::vec(0..=max_val, 0..300) })
|
||||
]
|
||||
}
|
||||
|
||||
#[cfg(target_arch = "x86_64")]
|
||||
proptest::proptest! {
|
||||
#[test]
|
||||
fn test_filter_compare_scalar_and_impls_impl_proptest(
|
||||
start in 0u32..400u32,
|
||||
end in 0u32..400u32,
|
||||
fn test_filter_compare_scalar_and_avx2_impl_proptest(
|
||||
start in proptest::prelude::any::<u32>(),
|
||||
end in proptest::prelude::any::<u32>(),
|
||||
offset in 0u32..2u32,
|
||||
vals in vals_strategy()) {
|
||||
for implementation in available_impls() {
|
||||
if implementation == FilterImplPerInstructionSet::Scalar {
|
||||
continue;
|
||||
}
|
||||
let mut impl_output = vals.clone();
|
||||
let mut scalar_output = vals.clone();
|
||||
implementation.filter_vec_in_place(start..=end, offset, &mut impl_output);
|
||||
FilterImplPerInstructionSet::Scalar.filter_vec_in_place(start..=end, offset, &mut scalar_output);
|
||||
assert_eq!(&impl_output, &scalar_output);
|
||||
}
|
||||
mut vals in proptest::collection::vec(0..u32::MAX, 0..30)) {
|
||||
if FilterImplPerInstructionSet::AVX2.is_available() {
|
||||
let mut vals_clone = vals.clone();
|
||||
FilterImplPerInstructionSet::AVX2.filter_vec_in_place(start..=end, offset, &mut vals);
|
||||
FilterImplPerInstructionSet::Scalar.filter_vec_in_place(start..=end, offset, &mut vals_clone);
|
||||
assert_eq!(&vals, &vals_clone);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,118 +0,0 @@
|
||||
use std::arch::aarch64::*;
|
||||
use std::ops::RangeInclusive;
|
||||
|
||||
const NUM_LANES: usize = 4;
|
||||
|
||||
// Compacts matching lanes to the front using a byte-level shuffle.
|
||||
// `mask` is a 4-bit value: bit k=1 means lane k should appear in the output.
|
||||
#[inline]
|
||||
#[target_feature(enable = "neon")]
|
||||
unsafe fn compact(data: uint32x4_t, mask: u8) -> uint32x4_t {
|
||||
unsafe {
|
||||
// SAFETY: mask is always in [0, 15] by construction (max sum of [1,2,4,8]).
|
||||
// BYTE_SHUFFLE_TABLE has 16 entries, so this is always in bounds.
|
||||
let shuffle = BYTE_SHUFFLE_TABLE.get_unchecked(mask as usize);
|
||||
let shuffle_vec = vld1q_u8(shuffle.as_ptr());
|
||||
vreinterpretq_u32_u8(vqtbl1q_u8(vreinterpretq_u8_u32(data), shuffle_vec))
|
||||
}
|
||||
}
|
||||
|
||||
// Safe (not unsafe) because NEON is mandatory on aarch64: no runtime feature check needed.
|
||||
#[inline(never)]
|
||||
pub fn filter_vec_in_place(range: RangeInclusive<u32>, offset: u32, output: &mut Vec<u32>) {
|
||||
let num_words = output.len() / NUM_LANES;
|
||||
let mut output_len = unsafe {
|
||||
filter_vec_neon_aux(
|
||||
output.as_ptr(),
|
||||
range.clone(),
|
||||
output.as_mut_ptr(),
|
||||
offset,
|
||||
num_words,
|
||||
)
|
||||
};
|
||||
let remainder_start = num_words * NUM_LANES;
|
||||
for i in remainder_start..output.len() {
|
||||
let val = output[i];
|
||||
output[output_len] = offset + i as u32;
|
||||
output_len += if range.contains(&val) { 1 } else { 0 };
|
||||
}
|
||||
output.truncate(output_len);
|
||||
}
|
||||
|
||||
#[target_feature(enable = "neon")]
|
||||
unsafe fn filter_vec_neon_aux(
|
||||
input: *const u32,
|
||||
range: RangeInclusive<u32>,
|
||||
output: *mut u32,
|
||||
offset: u32,
|
||||
num_words: usize,
|
||||
) -> usize {
|
||||
unsafe {
|
||||
let mut input = input;
|
||||
let mut output_tail = output;
|
||||
let range_start_simd = vdupq_n_u32(*range.start());
|
||||
let range_end_simd = vdupq_n_u32(*range.end());
|
||||
let mut ids = vld1q_u32([offset, offset + 1, offset + 2, offset + 3].as_ptr());
|
||||
let shift = vdupq_n_u32(NUM_LANES as u32);
|
||||
let bit_weights = vld1q_u32([1u32, 2, 4, 8].as_ptr());
|
||||
|
||||
for _ in 0..num_words {
|
||||
let word = vld1q_u32(input);
|
||||
|
||||
// Unsigned compares: CMHS (compare higher or same) tests `word >= start`
|
||||
// and `end >= word`. ANDing both gives the inside-range mask directly,
|
||||
// which is cheaper than computing `outside` and then negating.
|
||||
let ge_start = vcgeq_u32(word, range_start_simd);
|
||||
let le_end = vcleq_u32(word, range_end_simd);
|
||||
// inside[k] = 0xFFFFFFFF if val[k] is in range, 0 otherwise.
|
||||
let inside = vandq_u32(ge_start, le_end);
|
||||
|
||||
// Build the 4-bit mask: AND bit_weights with the inside lane mask, so each
|
||||
// inside lane contributes its bit_weight (1, 2, 4, or 8). Summing yields the
|
||||
// 4-bit mask in one addv.
|
||||
let inside_bits = vandq_u32(bit_weights, inside);
|
||||
let mask = vaddvq_u32(inside_bits) as u8;
|
||||
// mask is mathematically bounded: max value is 1+2+4+8=15 (all lanes match)
|
||||
debug_assert!(mask <= 15, "mask must fit in 4 bits: {}", mask);
|
||||
|
||||
// Count of matching lanes = popcount(mask). Derives the count directly from
|
||||
// the mask instead of running a parallel SIMD reduction over `outside`.
|
||||
let added_len = mask.count_ones() as usize;
|
||||
|
||||
// Safe because mask is guaranteed to be in [0, 15]
|
||||
let filtered_ids = compact(ids, mask);
|
||||
vst1q_u32(output_tail, filtered_ids);
|
||||
output_tail = output_tail.add(added_len);
|
||||
ids = vaddq_u32(ids, shift);
|
||||
input = input.add(NUM_LANES);
|
||||
}
|
||||
|
||||
output_tail.offset_from(output) as usize
|
||||
}
|
||||
}
|
||||
|
||||
// Byte shuffle patterns to compact matching lanes to the front of the vector.
|
||||
// Index is a 4-bit mask: bit k=1 means lane k (bytes 4k..4k+3) is in-range.
|
||||
// The j-th set bit determines which input lane goes to output position j.
|
||||
const BYTE_SHUFFLE_TABLE: [[u8; 16]; 16] = [
|
||||
[
|
||||
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
|
||||
], // 0b0000: none
|
||||
[0, 1, 2, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16], // 0b0001: lane 0
|
||||
[4, 5, 6, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16], // 0b0010: lane 1
|
||||
[0, 1, 2, 3, 4, 5, 6, 7, 16, 16, 16, 16, 16, 16, 16, 16], // 0b0011: lanes 0,1
|
||||
[8, 9, 10, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16], // 0b0100: lane 2
|
||||
[0, 1, 2, 3, 8, 9, 10, 11, 16, 16, 16, 16, 16, 16, 16, 16], // 0b0101: lanes 0,2
|
||||
[4, 5, 6, 7, 8, 9, 10, 11, 16, 16, 16, 16, 16, 16, 16, 16], // 0b0110: lanes 1,2
|
||||
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 16, 16, 16, 16], // 0b0111: lanes 0,1,2
|
||||
[
|
||||
12, 13, 14, 15, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
|
||||
], // 0b1000: lane 3
|
||||
[0, 1, 2, 3, 12, 13, 14, 15, 16, 16, 16, 16, 16, 16, 16, 16], // 0b1001: lanes 0,3
|
||||
[4, 5, 6, 7, 12, 13, 14, 15, 16, 16, 16, 16, 16, 16, 16, 16], // 0b1010: lanes 1,3
|
||||
[0, 1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15, 16, 16, 16, 16], // 0b1011: lanes 0,1,3
|
||||
[8, 9, 10, 11, 12, 13, 14, 15, 16, 16, 16, 16, 16, 16, 16, 16], // 0b1100: lanes 2,3
|
||||
[0, 1, 2, 3, 8, 9, 10, 11, 12, 13, 14, 15, 16, 16, 16, 16], // 0b1101: lanes 0,2,3
|
||||
[4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 16, 16, 16], // 0b1110: lanes 1,2,3
|
||||
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], // 0b1111: all lanes
|
||||
];
|
||||
@@ -1,260 +0,0 @@
|
||||
use std::ops::RangeInclusive;
|
||||
|
||||
// SVE vector length (in u32 lanes) is not a compile-time constant; query at runtime.
|
||||
// Safe to call only when SVE is confirmed available via is_aarch64_feature_detected!("sve").
|
||||
#[target_feature(enable = "sve")]
|
||||
unsafe fn num_lanes() -> usize {
|
||||
let vl: usize;
|
||||
unsafe {
|
||||
core::arch::asm!(
|
||||
"cntw {vl}",
|
||||
vl = out(reg) vl,
|
||||
options(nostack, nomem, preserves_flags),
|
||||
);
|
||||
}
|
||||
vl
|
||||
}
|
||||
|
||||
// SAFETY: caller must ensure SVE is available (checked via is_aarch64_feature_detected!("sve")).
|
||||
// Unlike NEON, SVE is optional on aarch64 and not guaranteed by the target architecture.
|
||||
pub unsafe fn filter_vec_in_place(range: RangeInclusive<u32>, offset: u32, output: &mut Vec<u32>) {
|
||||
if range.start() > range.end() {
|
||||
output.clear();
|
||||
return;
|
||||
}
|
||||
let vl = unsafe { num_lanes() };
|
||||
let num_words = output.len() / vl;
|
||||
let range_start = *range.start();
|
||||
// Unsigned subtraction trick: val ∈ [lo, hi] ↔ (val - lo) ≤ᵤ (hi - lo).
|
||||
// Values below lo wrap around to large u32, so the single unsigned ≤ excludes them.
|
||||
let range_width = range.end().wrapping_sub(range_start);
|
||||
let mut output_len = unsafe {
|
||||
filter_vec_sve_aux(
|
||||
output.as_ptr(),
|
||||
range_start,
|
||||
range_width,
|
||||
output.as_mut_ptr(),
|
||||
offset,
|
||||
num_words,
|
||||
vl,
|
||||
)
|
||||
};
|
||||
let remainder_start = num_words * vl;
|
||||
for i in remainder_start..output.len() {
|
||||
let val = output[i];
|
||||
output[output_len] = offset + i as u32;
|
||||
output_len += if range.contains(&val) { 1 } else { 0 };
|
||||
}
|
||||
output.truncate(output_len);
|
||||
}
|
||||
|
||||
// Register allocation for the asm! blocks:
|
||||
// z0 ids_a (index vector for first half of each pair, advances by step2 each iter)
|
||||
// z1 range_width broadcast
|
||||
// z2 range_start broadcast
|
||||
// z3 step2 broadcast (2 * vl)
|
||||
// z4 ids_b (index vector for second half, = ids_a + step, advances by step2)
|
||||
// z5 scratch: loaded word_a, then compacted_a
|
||||
// z6 scratch: loaded word_b, then compacted_b
|
||||
// p0 all-true predicate (ptrue p0.s)
|
||||
// p1 in-range mask for word_a
|
||||
// p2 in-range mask for word_b
|
||||
#[target_feature(enable = "sve")]
|
||||
unsafe fn filter_vec_sve_aux(
|
||||
input: *const u32,
|
||||
range_start: u32,
|
||||
range_width: u32,
|
||||
output: *mut u32,
|
||||
offset: u32,
|
||||
num_words: usize,
|
||||
vl: usize,
|
||||
) -> usize {
|
||||
let num_pairs = num_words / 2;
|
||||
let mut input_ptr = input;
|
||||
let mut output_tail = output;
|
||||
|
||||
if num_pairs > 0 {
|
||||
unsafe {
|
||||
// We rely on asm! because the SVE intrinsics are not available in stable Rust.
|
||||
// The code that follows was generated by Rustc nightly based on the intrinsics version
|
||||
// at the bottom of this file.
|
||||
core::arch::asm!(
|
||||
// --- Setup ---
|
||||
// All-true predicate for 32-bit lanes.
|
||||
"ptrue p0.s",
|
||||
// ids_a = [offset, offset+1, offset+2, ...]
|
||||
"index z0.s, {offset:w}, #1",
|
||||
// Broadcast scalars into SVE vectors.
|
||||
"mov z1.s, {range_width:w}",
|
||||
"mov z2.s, {range_start:w}",
|
||||
// vl_gpr = number of 32-bit lanes (cntw).
|
||||
"cntw {vl_gpr}",
|
||||
// step2_bytes will first hold 2*vl (for the step2 vector), then 2*VL in bytes.
|
||||
"lsl {step2_bytes}, {vl_gpr}, #1",
|
||||
// z4 = step = [vl, vl, ...]; will become ids_b after the add below.
|
||||
"mov z4.s, {vl_gpr:w}",
|
||||
// z3 = step2 = [2*vl, 2*vl, ...], used to advance both id vectors each iter.
|
||||
"mov z3.s, {step2_bytes:w}",
|
||||
// Repurpose step2_bytes to hold the byte stride for advancing the input pointer
|
||||
// by two full SVE vectors per iteration.
|
||||
"rdvl {step2_bytes}, #2",
|
||||
// ids_b = ids_a + step = [offset+vl, offset+vl+1, ...]
|
||||
"add z4.s, z0.s, z4.s",
|
||||
|
||||
// --- Main loop: process two SVE vectors (ids_a and ids_b) per iteration ---
|
||||
"0:",
|
||||
// Load two consecutive SVE vectors from input.
|
||||
"ld1w {{z5.s}}, p0/z, [{input}]",
|
||||
"ld1w {{z6.s}}, p0/z, [{input}, #1, mul vl]",
|
||||
// Advance input pointer by 2 * VL bytes.
|
||||
"add {input}, {input}, {step2_bytes}",
|
||||
// Unsigned shift: subtract range_start so in-range check becomes a single cmpu ≤.
|
||||
"sub z5.s, z5.s, z2.s",
|
||||
"sub z6.s, z6.s, z2.s",
|
||||
// in_range: shifted value ≤ range_width (unsigned, so values below lo also fail).
|
||||
"cmphs p1.s, p0/z, z1.s, z5.s",
|
||||
"cmphs p2.s, p0/z, z1.s, z6.s",
|
||||
// Count matching lanes; both cntp calls have independent inputs for OOO parallelism.
|
||||
"cntp {cnt_a}, p0, p1.s",
|
||||
"compact z5.s, p1, z0.s",
|
||||
"compact z6.s, p2, z4.s",
|
||||
"cntp {cnt_b}, p0, p2.s",
|
||||
// Advance id vectors for the next iteration.
|
||||
"add z0.s, z0.s, z3.s",
|
||||
"add z4.s, z4.s, z3.s",
|
||||
// Store compacted ids. Only the first cnt_a / cnt_b slots are valid; the rest
|
||||
// will be overwritten by subsequent iterations before the final truncate.
|
||||
"str z5, [{out}]",
|
||||
"st1w {{z6.s}}, p0, [{out}, {cnt_a}, lsl #2]",
|
||||
"add {out}, {out}, {cnt_a}, lsl #2",
|
||||
"add {out}, {out}, {cnt_b}, lsl #2",
|
||||
"subs {pairs}, {pairs}, #1",
|
||||
"b.ne 0b",
|
||||
|
||||
// --- Operands ---
|
||||
input = inout(reg) input_ptr,
|
||||
out = inout(reg) output_tail,
|
||||
pairs = inout(reg) num_pairs => _,
|
||||
offset = in(reg) offset,
|
||||
range_start = in(reg) range_start,
|
||||
range_width = in(reg) range_width,
|
||||
vl_gpr = out(reg) _,
|
||||
step2_bytes = out(reg) _,
|
||||
cnt_a = out(reg) _,
|
||||
cnt_b = out(reg) _,
|
||||
out("p0") _, out("p1") _, out("p2") _,
|
||||
out("v0") _, out("v1") _, out("v2") _, out("v3") _,
|
||||
out("v4") _, out("v5") _, out("v6") _,
|
||||
options(nostack),
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
// Handle an odd trailing vector.
|
||||
if num_words % 2 == 1 {
|
||||
// ids_a for the odd word starts at offset + num_pairs * 2 * vl.
|
||||
// input_ptr was advanced by the main loop and now points at the odd word.
|
||||
let odd_offset =
|
||||
offset.wrapping_add((num_pairs as u32).wrapping_mul(2).wrapping_mul(vl as u32));
|
||||
unsafe {
|
||||
core::arch::asm!(
|
||||
"ptrue p0.s",
|
||||
"index z0.s, {odd_offset:w}, #1",
|
||||
"mov z1.s, {range_width:w}",
|
||||
"mov z2.s, {range_start:w}",
|
||||
"ld1w {{z3.s}}, p0/z, [{input}]",
|
||||
"sub z3.s, z3.s, z2.s",
|
||||
"cmphs p1.s, p0/z, z1.s, z3.s",
|
||||
"cntp {cnt}, p0, p1.s",
|
||||
"compact z0.s, p1, z0.s",
|
||||
"str z0, [{out}]",
|
||||
"add {out}, {out}, {cnt}, lsl #2",
|
||||
odd_offset = in(reg) odd_offset,
|
||||
range_width = in(reg) range_width,
|
||||
range_start = in(reg) range_start,
|
||||
input = in(reg) input_ptr,
|
||||
out = inout(reg) output_tail,
|
||||
cnt = out(reg) _,
|
||||
out("p0") _, out("p1") _,
|
||||
out("v0") _, out("v1") _, out("v2") _, out("v3") _,
|
||||
options(nostack),
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
unsafe { output_tail.offset_from(output) as usize }
|
||||
}
|
||||
|
||||
// SVE implements with intrinsics.
|
||||
//
|
||||
// #[target_feature(enable = "sve")]
|
||||
// unsafe fn filter_vec_sve_aux(
|
||||
// input: *const u32,
|
||||
// range_start: u32,
|
||||
// range_width: u32,
|
||||
// output: *mut u32,
|
||||
// offset: u32,
|
||||
// num_words: usize,
|
||||
// vl: usize,
|
||||
// ) -> usize {
|
||||
// unsafe {
|
||||
// let all_true = svptrue_b32();
|
||||
// let range_start_simd = svdup_n_u32(range_start);
|
||||
// let range_width_simd = svdup_n_u32(range_width);
|
||||
// // ids_a covers [offset .. offset+vl), ids_b covers the next vl ids.
|
||||
// // Keeping them separate breaks the loop-carried dependency through ids so
|
||||
// // both compact/cntp chains are fully independent within each unrolled body.
|
||||
// let mut ids_a = svindex_u32(offset, 1);
|
||||
// let step = svdup_n_u32(vl as u32);
|
||||
// let step2 = svdup_n_u32(2 * vl as u32);
|
||||
// let mut ids_b = svadd_u32_x(all_true, ids_a, step);
|
||||
|
||||
// let mut input = input;
|
||||
// let mut output_tail = output;
|
||||
|
||||
// // Unrolled ×2: both cntp calls have independent inputs and execute in parallel.
|
||||
// // The two output_tail updates are sequential but together cost 4+1+1=6 cy per
|
||||
// // pair vs 5+5=10 cy for two scalar iterations, breaking the cntp latency chain.
|
||||
// let num_pairs = num_words / 2;
|
||||
// for _ in 0..num_pairs {
|
||||
// let word_a = svld1_u32(all_true, input);
|
||||
// let word_b = svld1_u32(all_true, input.add(vl));
|
||||
|
||||
// let shifted_a = svsub_u32_x(all_true, word_a, range_start_simd);
|
||||
// let shifted_b = svsub_u32_x(all_true, word_b, range_start_simd);
|
||||
|
||||
// let in_range_a = svcmple_u32(all_true, shifted_a, range_width_simd);
|
||||
// let in_range_b = svcmple_u32(all_true, shifted_b, range_width_simd);
|
||||
|
||||
// let compacted_a = svcompact_u32(in_range_a, ids_a);
|
||||
// let compacted_b = svcompact_u32(in_range_b, ids_b);
|
||||
// // cntp_a and cntp_b have independent inputs: OOO engine issues them in parallel.
|
||||
// let added_len_a = svcntp_b32(all_true, in_range_a) as usize;
|
||||
// let added_len_b = svcntp_b32(all_true, in_range_b) as usize;
|
||||
|
||||
// // Write the full vector — only the first added_len slots are valid.
|
||||
// // Subsequent iterations overwrite the trailing zeros before truncate.
|
||||
// svst1_u32(all_true, output_tail, compacted_a);
|
||||
// output_tail = output_tail.add(added_len_a);
|
||||
// svst1_u32(all_true, output_tail, compacted_b);
|
||||
// output_tail = output_tail.add(added_len_b);
|
||||
|
||||
// ids_a = svadd_u32_x(all_true, ids_a, step2);
|
||||
// ids_b = svadd_u32_x(all_true, ids_b, step2);
|
||||
// input = input.add(2 * vl);
|
||||
// }
|
||||
|
||||
// // Handle an odd trailing word.
|
||||
// if num_words % 2 == 1 {
|
||||
// let word = svld1_u32(all_true, input);
|
||||
// let shifted = svsub_u32_x(all_true, word, range_start_simd);
|
||||
// let in_range = svcmple_u32(all_true, shifted, range_width_simd);
|
||||
// let added_len = svcntp_b32(all_true, in_range) as usize;
|
||||
// let compacted_ids = svcompact_u32(in_range, ids_a);
|
||||
// svst1_u32(all_true, output_tail, compacted_ids);
|
||||
// output_tail = output_tail.add(added_len);
|
||||
// }
|
||||
|
||||
// output_tail.offset_from(output) as usize
|
||||
// }
|
||||
// }
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy-columnar"
|
||||
version = "0.7.0"
|
||||
version = "0.6.0"
|
||||
edition = "2024"
|
||||
license = "MIT"
|
||||
homepage = "https://github.com/quickwit-oss/tantivy"
|
||||
@@ -12,18 +12,18 @@ categories = ["database-implementations", "data-structures", "compression"]
|
||||
itertools = "0.14.0"
|
||||
fastdivide = "0.4.0"
|
||||
|
||||
stacker = { version= "0.7", path = "../stacker", package="tantivy-stacker"}
|
||||
sstable = { version= "0.7", path = "../sstable", package = "tantivy-sstable" }
|
||||
common = { version= "0.11", path = "../common", package = "tantivy-common" }
|
||||
tantivy-bitpacker = { version= "0.10", path = "../bitpacker/" }
|
||||
stacker = { version= "0.6", path = "../stacker", package="tantivy-stacker"}
|
||||
sstable = { version= "0.6", path = "../sstable", package = "tantivy-sstable" }
|
||||
common = { version= "0.10", path = "../common", package = "tantivy-common" }
|
||||
tantivy-bitpacker = { version= "0.9", path = "../bitpacker/" }
|
||||
serde = "1.0.152"
|
||||
downcast-rs = "2.0.1"
|
||||
|
||||
[dev-dependencies]
|
||||
proptest = "1"
|
||||
more-asserts = "0.3.1"
|
||||
rand = "0.9"
|
||||
binggan = "0.17.0"
|
||||
rand = "0.8"
|
||||
binggan = "0.14.0"
|
||||
|
||||
[[bench]]
|
||||
name = "bench_merge"
|
||||
|
||||
@@ -73,7 +73,7 @@ The crate introduces the following concepts.
|
||||
`Columnar` is an equivalent of a dataframe.
|
||||
It maps `column_key` to `Column`.
|
||||
|
||||
A `Column<T>` associates a `RowId` (u32) to any
|
||||
A `Column<T>` asssociates a `RowId` (u32) to any
|
||||
number of values.
|
||||
|
||||
This is made possible by wrapping a `ColumnIndex` and a `ColumnValue` object.
|
||||
|
||||
@@ -9,7 +9,7 @@ use tantivy_columnar::column_values::{CodecType, serialize_and_load_u64_based_co
|
||||
fn get_data() -> Vec<u64> {
|
||||
let mut rng = StdRng::seed_from_u64(2u64);
|
||||
let mut data: Vec<_> = (100..55_000_u64)
|
||||
.map(|num| num + rng.random::<u8>() as u64)
|
||||
.map(|num| num + rng.r#gen::<u8>() as u64)
|
||||
.collect();
|
||||
data.push(99_000);
|
||||
data.insert(1000, 2000);
|
||||
|
||||
@@ -6,7 +6,7 @@ use tantivy_columnar::column_values::{CodecType, serialize_u64_based_column_valu
|
||||
fn get_data() -> Vec<u64> {
|
||||
let mut rng = StdRng::seed_from_u64(2u64);
|
||||
let mut data: Vec<_> = (100..55_000_u64)
|
||||
.map(|num| num + rng.random::<u8>() as u64)
|
||||
.map(|num| num + rng.r#gen::<u8>() as u64)
|
||||
.collect();
|
||||
data.push(99_000);
|
||||
data.insert(1000, 2000);
|
||||
|
||||
@@ -28,7 +28,7 @@ fn get_test_columns() -> Columns {
|
||||
}
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer
|
||||
.serialize(data.len() as u32, None, &mut buffer)
|
||||
.serialize(data.len() as u32, &mut buffer)
|
||||
.unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
|
||||
@@ -89,6 +89,13 @@ fn main() {
|
||||
black_box(sum);
|
||||
});
|
||||
|
||||
group.register("first_block_fetch", |column| {
|
||||
let mut block: Vec<Option<u64>> = vec![None; 64];
|
||||
let fetch_docids = (0..64).collect::<Vec<_>>();
|
||||
column.first_vals(&fetch_docids, &mut block);
|
||||
black_box(block[0]);
|
||||
});
|
||||
|
||||
group.register("first_block_single_calls", |column| {
|
||||
let mut block: Vec<Option<u64>> = vec![None; 64];
|
||||
let fetch_docids = (0..64).collect::<Vec<_>>();
|
||||
|
||||
@@ -8,7 +8,7 @@ const TOTAL_NUM_VALUES: u32 = 1_000_000;
|
||||
fn gen_optional_index(fill_ratio: f64) -> OptionalIndex {
|
||||
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
|
||||
let vals: Vec<u32> = (0..TOTAL_NUM_VALUES)
|
||||
.map(|_| rng.random_bool(fill_ratio))
|
||||
.map(|_| rng.gen_bool(fill_ratio))
|
||||
.enumerate()
|
||||
.filter(|(_pos, val)| *val)
|
||||
.map(|(pos, _)| pos as u32)
|
||||
@@ -25,7 +25,7 @@ fn random_range_iterator(
|
||||
let mut rng: StdRng = StdRng::from_seed([1u8; 32]);
|
||||
let mut current = start;
|
||||
std::iter::from_fn(move || {
|
||||
current += rng.random_range(avg_step_size - avg_deviation..=avg_step_size + avg_deviation);
|
||||
current += rng.gen_range(avg_step_size - avg_deviation..=avg_step_size + avg_deviation);
|
||||
if current >= end { None } else { Some(current) }
|
||||
})
|
||||
}
|
||||
|
||||
@@ -39,7 +39,7 @@ fn get_data_50percent_item() -> Vec<u128> {
|
||||
|
||||
let mut data = vec![];
|
||||
for _ in 0..300_000 {
|
||||
let val = rng.random_range(1..=100);
|
||||
let val = rng.gen_range(1..=100);
|
||||
data.push(val);
|
||||
}
|
||||
data.push(SINGLE_ITEM);
|
||||
|
||||
@@ -34,7 +34,7 @@ fn get_data_50percent_item() -> Vec<u128> {
|
||||
|
||||
let mut data = vec![];
|
||||
for _ in 0..300_000 {
|
||||
let val = rng.random_range(1..=100);
|
||||
let val = rng.gen_range(1..=100);
|
||||
data.push(val);
|
||||
}
|
||||
data.push(SINGLE_ITEM);
|
||||
|
||||
@@ -54,6 +54,6 @@ pub fn generate_columnar_with_name(card: Card, num_docs: u32, column_name: &str)
|
||||
}
|
||||
|
||||
let mut wrt: Vec<u8> = Vec::new();
|
||||
columnar_writer.serialize(num_docs, None, &mut wrt).unwrap();
|
||||
columnar_writer.serialize(num_docs, &mut wrt).unwrap();
|
||||
ColumnarReader::open(wrt).unwrap()
|
||||
}
|
||||
|
||||
@@ -15,37 +15,9 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
|
||||
{
|
||||
#[inline]
|
||||
pub fn fetch_block<'a>(&'a mut self, docs: &'a [u32], accessor: &Column<T>) {
|
||||
self.fetch_block_with_is_full(docs, accessor, accessor.index.get_cardinality().is_full());
|
||||
}
|
||||
|
||||
/// Like [`Self::fetch_block`] but takes the column's fullness instead of querying
|
||||
/// `accessor.index.get_cardinality()` each call — for callers that know it up front (e.g.
|
||||
/// checked once at construction). `is_full` must equal
|
||||
/// `accessor.index.get_cardinality().is_full()`.
|
||||
#[inline]
|
||||
pub fn fetch_block_with_is_full<'a>(
|
||||
&'a mut self,
|
||||
docs: &'a [u32],
|
||||
accessor: &Column<T>,
|
||||
is_full: bool,
|
||||
) {
|
||||
if is_full {
|
||||
// Skip the resize when already the right length (common case: fixed-size blocks).
|
||||
if self.val_cache.len() != docs.len() {
|
||||
self.val_cache.resize(docs.len(), T::default());
|
||||
}
|
||||
// When the docs form a contiguous ascending run we can fetch the values
|
||||
// as a single range. This lets codecs (e.g. bitpacked) bulk-decode the
|
||||
// slice instead of gathering value-by-value, and avoids per-value dynamic
|
||||
// dispatch. `docs` is always sorted ascending and free of duplicates here,
|
||||
// so comparing the endpoints is enough to detect contiguity.
|
||||
if is_contiguous(docs) {
|
||||
accessor
|
||||
.values
|
||||
.get_range(docs[0] as u64, &mut self.val_cache);
|
||||
} else {
|
||||
accessor.values.get_vals(docs, &mut self.val_cache);
|
||||
}
|
||||
if accessor.index.get_cardinality().is_full() {
|
||||
self.val_cache.resize(docs.len(), T::default());
|
||||
accessor.values.get_vals(docs, &mut self.val_cache);
|
||||
} else {
|
||||
self.docid_cache.clear();
|
||||
self.row_id_cache.clear();
|
||||
@@ -57,20 +29,12 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
|
||||
}
|
||||
}
|
||||
#[inline]
|
||||
pub fn fetch_block_with_missing(
|
||||
&mut self,
|
||||
docs: &[u32],
|
||||
accessor: &Column<T>,
|
||||
missing_opt: Option<T>,
|
||||
) {
|
||||
pub fn fetch_block_with_missing(&mut self, docs: &[u32], accessor: &Column<T>, missing: T) {
|
||||
self.fetch_block(docs, accessor);
|
||||
// no missing values
|
||||
if accessor.index.get_cardinality().is_full() {
|
||||
return;
|
||||
}
|
||||
let Some(missing) = missing_opt else {
|
||||
return;
|
||||
};
|
||||
|
||||
// We can compare docid_cache length with docs to find missing docs
|
||||
// For multi value columns we can't rely on the length and always need to scan
|
||||
@@ -86,78 +50,6 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
|
||||
}
|
||||
}
|
||||
|
||||
/// Like `fetch_block_with_missing`, but deduplicates (doc_id, value) pairs
|
||||
/// so that each unique value per document is returned only once.
|
||||
///
|
||||
/// This is necessary for correct document counting in aggregations,
|
||||
/// where multi-valued fields can produce duplicate entries that inflate counts.
|
||||
#[inline]
|
||||
pub fn fetch_block_with_missing_unique_per_doc(
|
||||
&mut self,
|
||||
docs: &[u32],
|
||||
accessor: &Column<T>,
|
||||
missing: Option<T>,
|
||||
) where
|
||||
T: Ord,
|
||||
{
|
||||
self.fetch_block_with_missing(docs, accessor, missing);
|
||||
if accessor.index.get_cardinality().is_multivalue() {
|
||||
self.dedup_docid_val_pairs();
|
||||
}
|
||||
}
|
||||
|
||||
/// Removes duplicate (doc_id, value) pairs from the caches.
|
||||
///
|
||||
/// After `fetch_block`, entries are sorted by doc_id, but values within
|
||||
/// the same doc may not be sorted (e.g. `(0,1), (0,2), (0,1)`).
|
||||
/// We group consecutive entries by doc_id, sort values within each group
|
||||
/// if it has more than 2 elements, then deduplicate adjacent pairs.
|
||||
///
|
||||
/// Skips entirely if no doc_id appears more than once in the block.
|
||||
fn dedup_docid_val_pairs(&mut self)
|
||||
where T: Ord {
|
||||
if self.docid_cache.len() <= 1 {
|
||||
return;
|
||||
}
|
||||
|
||||
// Quick check: if no consecutive doc_ids are equal, no dedup needed.
|
||||
let has_multivalue = self.docid_cache.windows(2).any(|w| w[0] == w[1]);
|
||||
if !has_multivalue {
|
||||
return;
|
||||
}
|
||||
|
||||
// Sort values within each doc_id group so duplicates become adjacent.
|
||||
let mut start = 0;
|
||||
while start < self.docid_cache.len() {
|
||||
let doc = self.docid_cache[start];
|
||||
let mut end = start + 1;
|
||||
while end < self.docid_cache.len() && self.docid_cache[end] == doc {
|
||||
end += 1;
|
||||
}
|
||||
if end - start > 2 {
|
||||
self.val_cache[start..end].sort();
|
||||
}
|
||||
start = end;
|
||||
}
|
||||
|
||||
// Now duplicates are adjacent — deduplicate in place.
|
||||
let mut write = 0;
|
||||
for read in 1..self.docid_cache.len() {
|
||||
if self.docid_cache[read] != self.docid_cache[write]
|
||||
|| self.val_cache[read] != self.val_cache[write]
|
||||
{
|
||||
write += 1;
|
||||
if write != read {
|
||||
self.docid_cache[write] = self.docid_cache[read];
|
||||
self.val_cache[write] = self.val_cache[read];
|
||||
}
|
||||
}
|
||||
}
|
||||
let new_len = write + 1;
|
||||
self.docid_cache.truncate(new_len);
|
||||
self.val_cache.truncate(new_len);
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn iter_vals(&self) -> impl Iterator<Item = T> + '_ {
|
||||
self.val_cache.iter().cloned()
|
||||
@@ -186,22 +78,6 @@ impl<T: PartialOrd + Copy + std::fmt::Debug + Send + Sync + 'static + Default>
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns true if `docs` is a contiguous ascending run `[d, d + 1, ..., d + n - 1]`.
|
||||
///
|
||||
/// Assumes `docs` is sorted ascending and free of duplicates (the invariant for the
|
||||
/// doc blocks passed to `fetch_block`), so comparing the endpoints is sufficient.
|
||||
#[inline]
|
||||
fn is_contiguous(docs: &[u32]) -> bool {
|
||||
let (Some(&first), Some(&last)) = (docs.first(), docs.last()) else {
|
||||
return false;
|
||||
};
|
||||
debug_assert!(
|
||||
docs.windows(2).all(|w| w[0] < w[1]),
|
||||
"fetch_block requires docs sorted ascending without duplicates"
|
||||
);
|
||||
(last - first) as usize + 1 == docs.len()
|
||||
}
|
||||
|
||||
/// Given two sorted lists of docids `docs` and `hits`, hits is a subset of `docs`.
|
||||
/// Return all docs that are not in `hits`.
|
||||
fn find_missing_docs<F>(docs: &[u32], hits: &[u32], mut callback: F)
|
||||
@@ -235,7 +111,6 @@ where F: FnMut(u32) {
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
#[allow(clippy::field_reassign_with_default)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
@@ -280,98 +155,4 @@ mod tests {
|
||||
|
||||
assert_eq!(missing_docs, vec![1, 2, 3, 4, 5]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_consecutive() {
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 2, 3];
|
||||
accessor.val_cache = vec![10, 10, 10, 10];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 2, 3]);
|
||||
assert_eq!(accessor.val_cache, vec![10, 10, 10]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_non_consecutive() {
|
||||
// (0,1), (0,2), (0,1) — duplicate value not adjacent
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 0];
|
||||
accessor.val_cache = vec![1, 2, 1];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 0]);
|
||||
assert_eq!(accessor.val_cache, vec![1, 2]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_multi_doc() {
|
||||
// doc 0: values [3, 1, 3], doc 1: values [5, 5]
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 0, 1, 1];
|
||||
accessor.val_cache = vec![3, 1, 3, 5, 5];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
|
||||
assert_eq!(accessor.val_cache, vec![1, 3, 5]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_no_duplicates() {
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0, 0, 1];
|
||||
accessor.val_cache = vec![1, 2, 3];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0, 0, 1]);
|
||||
assert_eq!(accessor.val_cache, vec![1, 2, 3]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dedup_docid_val_pairs_single_element() {
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
accessor.docid_cache = vec![0];
|
||||
accessor.val_cache = vec![1];
|
||||
accessor.dedup_docid_val_pairs();
|
||||
assert_eq!(accessor.docid_cache, vec![0]);
|
||||
assert_eq!(accessor.val_cache, vec![1]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_is_contiguous() {
|
||||
assert!(!is_contiguous(&[]));
|
||||
assert!(is_contiguous(&[5]));
|
||||
assert!(is_contiguous(&[5, 6, 7, 8]));
|
||||
assert!(is_contiguous(&[0, 1, 2]));
|
||||
assert!(!is_contiguous(&[5, 7, 8]));
|
||||
assert!(!is_contiguous(&[0, 1, 3]));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_fetch_block_contiguous_and_gather_match() {
|
||||
use crate::column_index::ColumnIndex;
|
||||
use crate::column_values::{
|
||||
ALL_U64_CODEC_TYPES, serialize_and_load_u64_based_column_values,
|
||||
};
|
||||
|
||||
let vals: Vec<u64> = (0..200u64).map(|i| i * 7 + 3).collect();
|
||||
let values =
|
||||
serialize_and_load_u64_based_column_values::<u64>(&&vals[..], &ALL_U64_CODEC_TYPES);
|
||||
let column = Column {
|
||||
index: ColumnIndex::Full,
|
||||
values,
|
||||
};
|
||||
|
||||
let check = |accessor: &mut ColumnBlockAccessor<u64>, docs: &[u32]| {
|
||||
accessor.fetch_block(docs, &column);
|
||||
let got: Vec<(u32, u64)> = accessor.iter_docid_vals(docs, &column).collect();
|
||||
let expected: Vec<(u32, u64)> = docs.iter().map(|&d| (d, vals[d as usize])).collect();
|
||||
assert_eq!(got, expected);
|
||||
};
|
||||
|
||||
let mut accessor = ColumnBlockAccessor::<u64>::default();
|
||||
// Contiguous block -> get_range fast path.
|
||||
check(&mut accessor, &(10..74).collect::<Vec<u32>>());
|
||||
// Non-contiguous block -> get_vals gather path.
|
||||
check(&mut accessor, &[0, 5, 9, 100, 199]);
|
||||
// Single doc and full span.
|
||||
check(&mut accessor, &[42]);
|
||||
check(&mut accessor, &(0..200).collect::<Vec<u32>>());
|
||||
}
|
||||
}
|
||||
|
||||
@@ -85,8 +85,8 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
|
||||
}
|
||||
|
||||
#[inline]
|
||||
pub fn first(&self, doc_id: DocId) -> Option<T> {
|
||||
self.values_for_doc(doc_id).next()
|
||||
pub fn first(&self, row_id: RowId) -> Option<T> {
|
||||
self.values_for_doc(row_id).next()
|
||||
}
|
||||
|
||||
/// Load the first value for each docid in the provided slice.
|
||||
@@ -131,8 +131,6 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
|
||||
self.index.docids_to_rowids(doc_ids, doc_ids_out, row_ids)
|
||||
}
|
||||
|
||||
/// Get an iterator over the values for the provided docid.
|
||||
#[inline]
|
||||
pub fn values_for_doc(&self, doc_id: DocId) -> impl Iterator<Item = T> + '_ {
|
||||
self.index
|
||||
.value_row_ids(doc_id)
|
||||
@@ -160,6 +158,15 @@ impl<T: PartialOrd + Copy + Debug + Send + Sync + 'static> Column<T> {
|
||||
.select_batch_in_place(selected_docid_range.start, doc_ids);
|
||||
}
|
||||
|
||||
/// Fills the output vector with the (possibly multiple values that are associated_with
|
||||
/// `row_id`.
|
||||
///
|
||||
/// This method clears the `output` vector.
|
||||
pub fn fill_vals(&self, row_id: RowId, output: &mut Vec<T>) {
|
||||
output.clear();
|
||||
output.extend(self.values_for_doc(row_id));
|
||||
}
|
||||
|
||||
pub fn first_or_default_col(self, default_value: T) -> Arc<dyn ColumnValues<T>> {
|
||||
Arc::new(FirstValueWithDefault {
|
||||
column: self,
|
||||
|
||||
@@ -56,7 +56,7 @@ fn get_doc_ids_with_values<'a>(
|
||||
ColumnIndex::Full => Box::new(doc_range),
|
||||
ColumnIndex::Optional(optional_index) => Box::new(
|
||||
optional_index
|
||||
.iter_non_null_docs()
|
||||
.iter_docs()
|
||||
.map(move |row| row + doc_range.start),
|
||||
),
|
||||
ColumnIndex::Multivalued(multivalued_index) => match multivalued_index {
|
||||
@@ -73,7 +73,7 @@ fn get_doc_ids_with_values<'a>(
|
||||
MultiValueIndex::MultiValueIndexV2(multivalued_index) => Box::new(
|
||||
multivalued_index
|
||||
.optional_index
|
||||
.iter_non_null_docs()
|
||||
.iter_docs()
|
||||
.map(move |row| row + doc_range.start),
|
||||
),
|
||||
},
|
||||
@@ -105,11 +105,10 @@ fn get_num_values_iterator<'a>(
|
||||
) -> Box<dyn Iterator<Item = u32> + 'a> {
|
||||
match column_index {
|
||||
ColumnIndex::Empty { .. } => Box::new(std::iter::empty()),
|
||||
ColumnIndex::Full => Box::new(std::iter::repeat_n(1u32, num_docs as usize)),
|
||||
ColumnIndex::Optional(optional_index) => Box::new(std::iter::repeat_n(
|
||||
1u32,
|
||||
optional_index.num_non_nulls() as usize,
|
||||
)),
|
||||
ColumnIndex::Full => Box::new(std::iter::repeat(1u32).take(num_docs as usize)),
|
||||
ColumnIndex::Optional(optional_index) => {
|
||||
Box::new(std::iter::repeat(1u32).take(optional_index.num_non_nulls() as usize))
|
||||
}
|
||||
ColumnIndex::Multivalued(multivalued_index) => Box::new(
|
||||
multivalued_index
|
||||
.get_start_index_column()
|
||||
@@ -178,7 +177,7 @@ impl<'a> Iterable<RowId> for StackedOptionalIndex<'a> {
|
||||
ColumnIndex::Full => Box::new(columnar_row_range),
|
||||
ColumnIndex::Optional(optional_index) => Box::new(
|
||||
optional_index
|
||||
.iter_non_null_docs()
|
||||
.iter_docs()
|
||||
.map(move |row_id: RowId| columnar_row_range.start + row_id),
|
||||
),
|
||||
ColumnIndex::Multivalued(_) => {
|
||||
|
||||
@@ -215,32 +215,6 @@ impl MultiValueIndex {
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns an iterator over document ids that have at least one value.
|
||||
pub fn iter_non_null_docs(&self) -> Box<dyn Iterator<Item = DocId> + '_> {
|
||||
match self {
|
||||
MultiValueIndex::MultiValueIndexV1(idx) => {
|
||||
let mut doc: DocId = 0u32;
|
||||
let num_docs = idx.num_docs();
|
||||
Box::new(std::iter::from_fn(move || {
|
||||
// This is not the most efficient way to do this, but it's legacy code.
|
||||
while doc < num_docs {
|
||||
let cur = doc;
|
||||
doc += 1;
|
||||
let start = idx.start_index_column.get_val(cur);
|
||||
let end = idx.start_index_column.get_val(cur + 1);
|
||||
if end > start {
|
||||
return Some(cur);
|
||||
}
|
||||
}
|
||||
None
|
||||
}))
|
||||
}
|
||||
MultiValueIndex::MultiValueIndexV2(idx) => {
|
||||
Box::new(idx.optional_index.iter_non_null_docs())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Converts a list of ranks (row ids of values) in a 1:n index to the corresponding list of
|
||||
/// docids. Positions are converted inplace to docids.
|
||||
///
|
||||
@@ -375,7 +349,7 @@ mod tests {
|
||||
columnar_writer.record_numerical(5, "full", u64::MAX);
|
||||
|
||||
let mut wrt: Vec<u8> = Vec::new();
|
||||
columnar_writer.serialize(7, None, &mut wrt).unwrap();
|
||||
columnar_writer.serialize(7, &mut wrt).unwrap();
|
||||
|
||||
let reader = ColumnarReader::open(wrt).unwrap();
|
||||
// Open the column as u64
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
use std::io;
|
||||
use std::io::{self, Write};
|
||||
use std::sync::Arc;
|
||||
|
||||
mod set;
|
||||
@@ -11,7 +11,7 @@ use set_block::{
|
||||
};
|
||||
|
||||
use crate::iterable::Iterable;
|
||||
use crate::{DocId, RowId};
|
||||
use crate::{DocId, InvalidData, RowId};
|
||||
|
||||
/// The threshold for for number of elements after which we switch to dense block encoding.
|
||||
///
|
||||
@@ -88,7 +88,7 @@ pub struct OptionalIndex {
|
||||
|
||||
impl Iterable<u32> for &OptionalIndex {
|
||||
fn boxed_iter(&self) -> Box<dyn Iterator<Item = u32> + '_> {
|
||||
Box::new(self.iter_non_null_docs())
|
||||
Box::new(self.iter_docs())
|
||||
}
|
||||
}
|
||||
|
||||
@@ -280,9 +280,8 @@ impl OptionalIndex {
|
||||
self.num_non_null_docs
|
||||
}
|
||||
|
||||
pub fn iter_non_null_docs(&self) -> impl Iterator<Item = RowId> + '_ {
|
||||
// TODO optimize. We could iterate over the blocks directly.
|
||||
// We use the dense value ids and retrieve the doc ids via select.
|
||||
pub fn iter_docs(&self) -> impl Iterator<Item = RowId> + '_ {
|
||||
// TODO optimize
|
||||
let mut select_batch = self.select_cursor();
|
||||
(0..self.num_non_null_docs).map(move |rank| select_batch.select(rank))
|
||||
}
|
||||
@@ -335,6 +334,38 @@ enum Block<'a> {
|
||||
Sparse(SparseBlock<'a>),
|
||||
}
|
||||
|
||||
#[derive(Debug, Copy, Clone)]
|
||||
enum OptionalIndexCodec {
|
||||
Dense = 0,
|
||||
Sparse = 1,
|
||||
}
|
||||
|
||||
impl OptionalIndexCodec {
|
||||
fn to_code(self) -> u8 {
|
||||
self as u8
|
||||
}
|
||||
|
||||
fn try_from_code(code: u8) -> Result<Self, InvalidData> {
|
||||
match code {
|
||||
0 => Ok(Self::Dense),
|
||||
1 => Ok(Self::Sparse),
|
||||
_ => Err(InvalidData),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl BinarySerializable for OptionalIndexCodec {
|
||||
fn serialize<W: Write + ?Sized>(&self, writer: &mut W) -> io::Result<()> {
|
||||
writer.write_all(&[self.to_code()])
|
||||
}
|
||||
|
||||
fn deserialize<R: io::Read>(reader: &mut R) -> io::Result<Self> {
|
||||
let optional_codec_code = u8::deserialize(reader)?;
|
||||
let optional_codec = Self::try_from_code(optional_codec_code)?;
|
||||
Ok(optional_codec)
|
||||
}
|
||||
}
|
||||
|
||||
fn serialize_optional_index_block(block_els: &[u16], out: &mut impl io::Write) -> io::Result<()> {
|
||||
let is_sparse = is_sparse(block_els.len() as u32);
|
||||
if is_sparse {
|
||||
|
||||
@@ -15,9 +15,7 @@ fn test_optional_index_with_num_docs(num_docs: u32) {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_numerical(100, "score", 80i64);
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer
|
||||
.serialize(num_docs, None, &mut buffer)
|
||||
.unwrap();
|
||||
dataframe_writer.serialize(num_docs, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar.num_columns(), 1);
|
||||
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("score").unwrap();
|
||||
@@ -166,11 +164,7 @@ fn test_optional_index_large() {
|
||||
fn test_optional_index_iter_aux(row_ids: &[RowId], num_rows: RowId) {
|
||||
let optional_index = OptionalIndex::for_test(num_rows, row_ids);
|
||||
assert_eq!(optional_index.num_docs(), num_rows);
|
||||
assert!(
|
||||
optional_index
|
||||
.iter_non_null_docs()
|
||||
.eq(row_ids.iter().copied())
|
||||
);
|
||||
assert!(optional_index.iter_docs().eq(row_ids.iter().copied()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -31,7 +31,7 @@ pub use u64_based::{
|
||||
serialize_and_load_u64_based_column_values, serialize_u64_based_column_values,
|
||||
};
|
||||
pub use u128_based::{
|
||||
CompactHit, CompactSpaceU64Accessor, open_u128_as_compact_u64, open_u128_mapped,
|
||||
CompactSpaceU64Accessor, open_u128_as_compact_u64, open_u128_mapped,
|
||||
serialize_column_values_u128,
|
||||
};
|
||||
pub use vec_column::VecColumn;
|
||||
@@ -119,18 +119,8 @@ pub trait ColumnValues<T: PartialOrd = u64>: Send + Sync + DowncastSync {
|
||||
/// the segment's `maxdoc`.
|
||||
#[inline(always)]
|
||||
fn get_range(&self, start: u64, output: &mut [T]) {
|
||||
let mut out_chunks = output.chunks_exact_mut(4);
|
||||
let mut idx = start;
|
||||
for out_x4 in out_chunks.by_ref() {
|
||||
out_x4[0] = self.get_val(idx as u32);
|
||||
out_x4[1] = self.get_val((idx + 1) as u32);
|
||||
out_x4[2] = self.get_val((idx + 2) as u32);
|
||||
out_x4[3] = self.get_val((idx + 3) as u32);
|
||||
idx += 4;
|
||||
}
|
||||
for out in out_chunks.into_remainder() {
|
||||
for (out, idx) in output.iter_mut().zip(start..) {
|
||||
*out = self.get_val(idx as u32);
|
||||
idx += 1;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
use std::fmt::Debug;
|
||||
use std::net::Ipv6Addr;
|
||||
|
||||
/// Monotonic maps a value to u128 value space
|
||||
/// Montonic maps a value to u128 value space
|
||||
/// Monotonic mapping enables `PartialOrd` on u128 space without conversion to original space.
|
||||
pub trait MonotonicallyMappableToU128: 'static + PartialOrd + Copy + Debug + Send + Sync {
|
||||
/// Converts a value to u128.
|
||||
|
||||
@@ -185,10 +185,10 @@ impl CompactSpaceBuilder {
|
||||
let mut covered_space = Vec::with_capacity(self.blanks.len());
|
||||
|
||||
// beginning of the blanks
|
||||
if let Some(first_blank_start) = self.blanks.first().map(RangeInclusive::start)
|
||||
&& *first_blank_start != 0
|
||||
{
|
||||
covered_space.push(0..=first_blank_start - 1);
|
||||
if let Some(first_blank_start) = self.blanks.first().map(RangeInclusive::start) {
|
||||
if *first_blank_start != 0 {
|
||||
covered_space.push(0..=first_blank_start - 1);
|
||||
}
|
||||
}
|
||||
|
||||
// Between the blanks
|
||||
@@ -202,10 +202,10 @@ impl CompactSpaceBuilder {
|
||||
covered_space.extend(between_blanks);
|
||||
|
||||
// end of the blanks
|
||||
if let Some(last_blank_end) = self.blanks.last().map(RangeInclusive::end)
|
||||
&& *last_blank_end != u128::MAX
|
||||
{
|
||||
covered_space.push(last_blank_end + 1..=u128::MAX);
|
||||
if let Some(last_blank_end) = self.blanks.last().map(RangeInclusive::end) {
|
||||
if *last_blank_end != u128::MAX {
|
||||
covered_space.push(last_blank_end + 1..=u128::MAX);
|
||||
}
|
||||
}
|
||||
|
||||
if covered_space.is_empty() {
|
||||
|
||||
@@ -292,19 +292,6 @@ impl BinarySerializable for IPCodecParams {
|
||||
}
|
||||
}
|
||||
|
||||
/// Represents the result of looking up a u128 value in the compact space.
|
||||
///
|
||||
/// If a value is outside the compact space, the next compact value is returned.
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
|
||||
pub enum CompactHit {
|
||||
/// The value exists in the compact space
|
||||
Exact(u32),
|
||||
/// The value does not exist in the compact space, but the next higher value does
|
||||
Next(u32),
|
||||
/// The value is greater than the maximum compact value
|
||||
AfterLast,
|
||||
}
|
||||
|
||||
/// Exposes the compact space compressed values as u64.
|
||||
///
|
||||
/// This allows faster access to the values, as u64 is faster to work with than u128.
|
||||
@@ -322,11 +309,6 @@ impl CompactSpaceU64Accessor {
|
||||
pub fn compact_to_u128(&self, compact: u32) -> u128 {
|
||||
self.0.compact_to_u128(compact)
|
||||
}
|
||||
|
||||
/// Finds the next compact space value for a given u128 value.
|
||||
pub fn u128_to_next_compact(&self, value: u128) -> CompactHit {
|
||||
self.0.u128_to_next_compact(value)
|
||||
}
|
||||
}
|
||||
|
||||
impl ColumnValues<u64> for CompactSpaceU64Accessor {
|
||||
@@ -459,21 +441,6 @@ impl CompactSpaceDecompressor {
|
||||
self.params.compact_space.u128_to_compact(value)
|
||||
}
|
||||
|
||||
/// Finds the next compact space value for a given u128 value.
|
||||
pub fn u128_to_next_compact(&self, value: u128) -> CompactHit {
|
||||
match self.u128_to_compact(value) {
|
||||
Ok(compact) => CompactHit::Exact(compact),
|
||||
Err(pos) => {
|
||||
if pos >= self.params.compact_space.ranges_mapping.len() {
|
||||
CompactHit::AfterLast
|
||||
} else {
|
||||
let next_range = &self.params.compact_space.ranges_mapping[pos];
|
||||
CompactHit::Next(next_range.compact_start)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn compact_to_u128(&self, compact: u32) -> u128 {
|
||||
self.params.compact_space.compact_to_u128(compact)
|
||||
}
|
||||
@@ -856,41 +823,6 @@ mod tests {
|
||||
let _data = test_aux_vals(vals);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_u128_to_next_compact() {
|
||||
let vals = &[100u128, 200u128, 1_000_000_000u128, 1_000_000_100u128];
|
||||
let mut data = test_aux_vals(vals);
|
||||
|
||||
let _header = U128Header::deserialize(&mut data);
|
||||
let decomp = CompactSpaceDecompressor::open(data).unwrap();
|
||||
|
||||
// Test value that's already in a range
|
||||
let compact_100 = decomp.u128_to_compact(100).unwrap();
|
||||
assert_eq!(
|
||||
decomp.u128_to_next_compact(100),
|
||||
CompactHit::Exact(compact_100)
|
||||
);
|
||||
|
||||
// Test value between two ranges
|
||||
let compact_million = decomp.u128_to_compact(1_000_000_000).unwrap();
|
||||
assert_eq!(
|
||||
decomp.u128_to_next_compact(250),
|
||||
CompactHit::Next(compact_million)
|
||||
);
|
||||
|
||||
// Test value before the first range
|
||||
assert_eq!(
|
||||
decomp.u128_to_next_compact(50),
|
||||
CompactHit::Next(compact_100)
|
||||
);
|
||||
|
||||
// Test value after the last range
|
||||
assert_eq!(
|
||||
decomp.u128_to_next_compact(10_000_000_000),
|
||||
CompactHit::AfterLast
|
||||
);
|
||||
}
|
||||
|
||||
use proptest::prelude::*;
|
||||
|
||||
fn num_strategy() -> impl Strategy<Value = u128> {
|
||||
|
||||
@@ -7,7 +7,7 @@ mod compact_space;
|
||||
|
||||
use common::{BinarySerializable, OwnedBytes, VInt};
|
||||
pub use compact_space::{
|
||||
CompactHit, CompactSpaceCompressor, CompactSpaceDecompressor, CompactSpaceU64Accessor,
|
||||
CompactSpaceCompressor, CompactSpaceDecompressor, CompactSpaceU64Accessor,
|
||||
};
|
||||
|
||||
use crate::column_values::monotonic_map_column;
|
||||
|
||||
@@ -41,6 +41,12 @@ fn transform_range_before_linear_transformation(
|
||||
if range.is_empty() {
|
||||
return None;
|
||||
}
|
||||
if stats.min_value > *range.end() {
|
||||
return None;
|
||||
}
|
||||
if stats.max_value < *range.start() {
|
||||
return None;
|
||||
}
|
||||
let shifted_range =
|
||||
range.start().saturating_sub(stats.min_value)..=range.end().saturating_sub(stats.min_value);
|
||||
let start_before_gcd_multiplication: u64 = div_ceil(*shifted_range.start(), stats.gcd);
|
||||
@@ -99,7 +105,7 @@ impl ColumnCodecEstimator for BitpackedCodecEstimator {
|
||||
|
||||
fn estimate(&self, stats: &ColumnStats) -> Option<u64> {
|
||||
let num_bits_per_value = num_bits(stats);
|
||||
Some(stats.num_bytes() + (stats.num_rows as u64 * (num_bits_per_value as u64)).div_ceil(8))
|
||||
Some(stats.num_bytes() + (stats.num_rows as u64 * (num_bits_per_value as u64) + 7) / 8)
|
||||
}
|
||||
|
||||
fn serialize(
|
||||
|
||||
@@ -8,7 +8,7 @@ use crate::column_values::ColumnValues;
|
||||
const MID_POINT: u64 = (1u64 << 32) - 1u64;
|
||||
|
||||
/// `Line` describes a line function `y: ax + b` using integer
|
||||
/// arithmetic.
|
||||
/// arithmetics.
|
||||
///
|
||||
/// The slope is in fact a decimal split into a 32 bit integer value,
|
||||
/// and a 32-bit decimal value.
|
||||
@@ -94,7 +94,7 @@ impl Line {
|
||||
// `(i, ys[])`.
|
||||
//
|
||||
// The best intercept therefore has the form
|
||||
// `y[i] - line.eval(i)` (using wrapping arithmetic).
|
||||
// `y[i] - line.eval(i)` (using wrapping arithmetics).
|
||||
// In other words, the best intercept is one of the `y - Line::eval(ys[i])`
|
||||
// and our task is just to pick the one that minimizes our error.
|
||||
//
|
||||
|
||||
@@ -117,7 +117,7 @@ impl ColumnCodecEstimator for LinearCodecEstimator {
|
||||
Some(
|
||||
stats.num_bytes()
|
||||
+ linear_params.num_bytes()
|
||||
+ (num_bits as u64 * stats.num_rows as u64).div_ceil(8),
|
||||
+ (num_bits as u64 * stats.num_rows as u64 + 7) / 8,
|
||||
)
|
||||
}
|
||||
|
||||
@@ -268,7 +268,7 @@ mod tests {
|
||||
|
||||
#[test]
|
||||
fn linear_interpol_fast_field_rand() {
|
||||
let mut rng = rand::rng();
|
||||
let mut rng = rand::thread_rng();
|
||||
for _ in 0..50 {
|
||||
let mut data = (0..10_000).map(|_| rng.next_u64()).collect::<Vec<_>>();
|
||||
create_and_validate::<LinearCodec>(&data, "random");
|
||||
|
||||
@@ -52,7 +52,7 @@ pub trait ColumnCodecEstimator<T = u64>: 'static {
|
||||
) -> io::Result<()>;
|
||||
}
|
||||
|
||||
/// A column codec describes a column serialization format.
|
||||
/// A column codec describes a colunm serialization format.
|
||||
pub trait ColumnCodec<T: PartialOrd = u64> {
|
||||
/// Specialized `ColumnValues` type.
|
||||
type ColumnValues: ColumnValues<T> + 'static;
|
||||
|
||||
@@ -121,24 +121,8 @@ pub(crate) fn create_and_validate<TColumnCodec: ColumnCodec>(
|
||||
reader.get_vals(&all_docs, &mut buffer);
|
||||
assert_eq!(vals, buffer);
|
||||
|
||||
// Validate `get_range` over the full column and a sub-range. The sub-range starts
|
||||
// at a non-zero offset to exercise the entrance-ramp alignment of the batch decode.
|
||||
buffer.resize(all_docs.len(), 0);
|
||||
reader.get_range(0, &mut buffer);
|
||||
assert_eq!(vals, buffer, "get_range (full) mismatch in data set {name}");
|
||||
if vals.len() >= 2 {
|
||||
let start = 1usize;
|
||||
buffer.resize(vals.len() - start, 0);
|
||||
reader.get_range(start as u64, &mut buffer);
|
||||
assert_eq!(
|
||||
&vals[start..],
|
||||
&buffer[..],
|
||||
"get_range (sub-range) mismatch in data set {name}"
|
||||
);
|
||||
}
|
||||
|
||||
if !vals.is_empty() {
|
||||
let test_rand_idx = rand::rng().random_range(0..=vals.len() - 1);
|
||||
let test_rand_idx = rand::thread_rng().gen_range(0..=vals.len() - 1);
|
||||
let expected_positions: Vec<u32> = vals
|
||||
.iter()
|
||||
.enumerate()
|
||||
|
||||
@@ -33,25 +33,6 @@ pub fn merge_bytes_or_str_column(
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Computes a per-segment mapping from old term ordinal to merged term ordinal.
|
||||
///
|
||||
/// Performs a streaming k-way merge of per-segment term dictionaries (SSTable-backed) to build
|
||||
/// a unified ordering. For each segment, the output is a `Vec<TermOrdinal>` where index `i`
|
||||
/// holds the merged global ordinal corresponding to segment-local ordinal `i`.
|
||||
///
|
||||
/// This is used by index sorting to compare terms from different segments without materializing
|
||||
/// term bytes in memory — only ordinals are compared.
|
||||
#[doc(hidden)]
|
||||
pub fn compute_merged_term_ord_mapping(
|
||||
bytes_columns: &[BytesColumn],
|
||||
) -> io::Result<Vec<Vec<TermOrdinal>>> {
|
||||
let bytes_columns_opt: Vec<Option<BytesColumn>> =
|
||||
bytes_columns.iter().cloned().map(Some).collect();
|
||||
let term_ord_mapping =
|
||||
merge_dict_and_compute_term_ord_mapping(&bytes_columns_opt, |_| true, |_| Ok(()))?;
|
||||
Ok(term_ord_mapping.into_per_segment_new_term_ordinals())
|
||||
}
|
||||
|
||||
struct RemappedTermOrdinalsValues<'a> {
|
||||
bytes_columns: &'a [Option<BytesColumn>],
|
||||
term_ord_mapping: &'a TermOrdinalMapping,
|
||||
@@ -137,14 +118,14 @@ fn is_term_present(bitsets: &[Option<BitSet>], term_merger: &TermMerger) -> bool
|
||||
false
|
||||
}
|
||||
|
||||
fn merge_dict_and_compute_term_ord_mapping(
|
||||
fn serialize_merged_dict(
|
||||
bytes_columns: &[Option<BytesColumn>],
|
||||
mut should_keep_term: impl FnMut(&TermMerger) -> bool,
|
||||
mut emit_term: impl FnMut(&[u8]) -> io::Result<()>,
|
||||
merge_row_order: &MergeRowOrder,
|
||||
output: &mut impl Write,
|
||||
) -> io::Result<TermOrdinalMapping> {
|
||||
let mut term_ord_mapping = TermOrdinalMapping::default();
|
||||
|
||||
let mut field_term_streams = Vec::with_capacity(bytes_columns.len());
|
||||
let mut field_term_streams = Vec::new();
|
||||
for (segment_ord, column_opt) in bytes_columns.iter().enumerate() {
|
||||
if let Some(column) = column_opt {
|
||||
term_ord_mapping.add_segment(column.dictionary.num_terms());
|
||||
@@ -160,33 +141,21 @@ fn merge_dict_and_compute_term_ord_mapping(
|
||||
}
|
||||
|
||||
let mut merged_terms = TermMerger::new(field_term_streams);
|
||||
let mut current_term_ord = 0;
|
||||
while merged_terms.advance() {
|
||||
if !should_keep_term(&merged_terms) {
|
||||
continue;
|
||||
}
|
||||
emit_term(merged_terms.key())?;
|
||||
for (segment_ord, from_term_ord) in merged_terms.matching_segments() {
|
||||
term_ord_mapping.register_from_to(segment_ord, from_term_ord, current_term_ord);
|
||||
}
|
||||
current_term_ord += 1;
|
||||
}
|
||||
|
||||
Ok(term_ord_mapping)
|
||||
}
|
||||
|
||||
fn serialize_merged_dict(
|
||||
bytes_columns: &[Option<BytesColumn>],
|
||||
merge_row_order: &MergeRowOrder,
|
||||
output: &mut impl Write,
|
||||
) -> io::Result<TermOrdinalMapping> {
|
||||
let mut sstable_builder = sstable::VoidSSTable::writer(output);
|
||||
let term_ord_mapping = match merge_row_order {
|
||||
MergeRowOrder::Stack(_) => merge_dict_and_compute_term_ord_mapping(
|
||||
bytes_columns,
|
||||
|_| true,
|
||||
|term_bytes| sstable_builder.insert(term_bytes, &()),
|
||||
)?,
|
||||
|
||||
match merge_row_order {
|
||||
MergeRowOrder::Stack(_) => {
|
||||
let mut current_term_ord = 0;
|
||||
while merged_terms.advance() {
|
||||
let term_bytes: &[u8] = merged_terms.key();
|
||||
sstable_builder.insert(term_bytes, &())?;
|
||||
for (segment_ord, from_term_ord) in merged_terms.matching_segments() {
|
||||
term_ord_mapping.register_from_to(segment_ord, from_term_ord, current_term_ord);
|
||||
}
|
||||
current_term_ord += 1;
|
||||
}
|
||||
sstable_builder.finish()?;
|
||||
}
|
||||
MergeRowOrder::Shuffled(shuffle_merge_order) => {
|
||||
assert_eq!(shuffle_merge_order.alive_bitsets.len(), bytes_columns.len());
|
||||
let mut term_bitsets: Vec<Option<BitSet>> = Vec::with_capacity(bytes_columns.len());
|
||||
@@ -205,14 +174,21 @@ fn serialize_merged_dict(
|
||||
}
|
||||
}
|
||||
}
|
||||
merge_dict_and_compute_term_ord_mapping(
|
||||
bytes_columns,
|
||||
|merged_terms| is_term_present(&term_bitsets[..], merged_terms),
|
||||
|term_bytes| sstable_builder.insert(term_bytes, &()),
|
||||
)?
|
||||
let mut current_term_ord = 0;
|
||||
while merged_terms.advance() {
|
||||
let term_bytes: &[u8] = merged_terms.key();
|
||||
if !is_term_present(&term_bitsets[..], &merged_terms) {
|
||||
continue;
|
||||
}
|
||||
sstable_builder.insert(term_bytes, &())?;
|
||||
for (segment_ord, from_term_ord) in merged_terms.matching_segments() {
|
||||
term_ord_mapping.register_from_to(segment_ord, from_term_ord, current_term_ord);
|
||||
}
|
||||
current_term_ord += 1;
|
||||
}
|
||||
sstable_builder.finish()?;
|
||||
}
|
||||
};
|
||||
sstable_builder.finish()?;
|
||||
}
|
||||
Ok(term_ord_mapping)
|
||||
}
|
||||
|
||||
@@ -235,8 +211,4 @@ impl TermOrdinalMapping {
|
||||
fn get_segment(&self, segment_ord: u32) -> &[TermOrdinal] {
|
||||
&self.per_segment_new_term_ordinals[segment_ord as usize]
|
||||
}
|
||||
|
||||
fn into_per_segment_new_term_ordinals(self) -> Vec<Vec<TermOrdinal>> {
|
||||
self.per_segment_new_term_ordinals
|
||||
}
|
||||
}
|
||||
|
||||
@@ -7,7 +7,6 @@ use std::io;
|
||||
use std::net::Ipv6Addr;
|
||||
use std::sync::Arc;
|
||||
|
||||
pub use merge_dict_column::compute_merged_term_ord_mapping;
|
||||
pub use merge_mapping::{MergeRowOrder, ShuffleMergeOrder, StackMergeOrder};
|
||||
|
||||
use super::writer::ColumnarSerializer;
|
||||
@@ -368,7 +367,7 @@ fn is_empty_after_merge(
|
||||
ColumnIndex::Empty { .. } => true,
|
||||
ColumnIndex::Full => alive_bitset.len() == 0,
|
||||
ColumnIndex::Optional(optional_index) => {
|
||||
for doc in optional_index.iter_non_null_docs() {
|
||||
for doc in optional_index.iter_docs() {
|
||||
if alive_bitset.contains(doc) {
|
||||
return false;
|
||||
}
|
||||
|
||||
@@ -17,7 +17,7 @@ fn make_columnar<T: Into<NumericalValue> + HasAssociatedColumnType + Copy>(
|
||||
}
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer
|
||||
.serialize(vals.len() as RowId, None, &mut buffer)
|
||||
.serialize(vals.len() as RowId, &mut buffer)
|
||||
.unwrap();
|
||||
ColumnarReader::open(buffer).unwrap()
|
||||
}
|
||||
@@ -143,9 +143,7 @@ fn make_numerical_columnar_multiple_columns(
|
||||
.max()
|
||||
.unwrap_or(0u32);
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer
|
||||
.serialize(num_rows, None, &mut buffer)
|
||||
.unwrap();
|
||||
dataframe_writer.serialize(num_rows, &mut buffer).unwrap();
|
||||
ColumnarReader::open(buffer).unwrap()
|
||||
}
|
||||
|
||||
@@ -168,9 +166,7 @@ fn make_byte_columnar_multiple_columns(
|
||||
}
|
||||
}
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer
|
||||
.serialize(num_rows, None, &mut buffer)
|
||||
.unwrap();
|
||||
dataframe_writer.serialize(num_rows, &mut buffer).unwrap();
|
||||
ColumnarReader::open(buffer).unwrap()
|
||||
}
|
||||
|
||||
@@ -189,9 +185,7 @@ fn make_text_columnar_multiple_columns(columns: &[(&str, &[&[&str]])]) -> Column
|
||||
.max()
|
||||
.unwrap_or(0u32);
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer
|
||||
.serialize(num_rows, None, &mut buffer)
|
||||
.unwrap();
|
||||
dataframe_writer.serialize(num_rows, &mut buffer).unwrap();
|
||||
ColumnarReader::open(buffer).unwrap()
|
||||
}
|
||||
|
||||
@@ -550,7 +544,7 @@ fn build_columnar(spec: &ColumnarSpec) -> ColumnarReader {
|
||||
}
|
||||
|
||||
let mut buffer = Vec::new();
|
||||
writer.serialize(max_row_id + 1, None, &mut buffer).unwrap();
|
||||
writer.serialize(max_row_id + 1, &mut buffer).unwrap();
|
||||
ColumnarReader::open(buffer).unwrap()
|
||||
}
|
||||
|
||||
|
||||
@@ -8,9 +8,6 @@ pub use column_type::{ColumnType, HasAssociatedColumnType};
|
||||
pub use format_version::{CURRENT_VERSION, Version};
|
||||
#[cfg(test)]
|
||||
pub(crate) use merge::ColumnTypeCategory;
|
||||
pub use merge::{
|
||||
MergeRowOrder, ShuffleMergeOrder, StackMergeOrder, compute_merged_term_ord_mapping,
|
||||
merge_columnar,
|
||||
};
|
||||
pub use merge::{MergeRowOrder, ShuffleMergeOrder, StackMergeOrder, merge_columnar};
|
||||
pub use reader::ColumnarReader;
|
||||
pub use writer::ColumnarWriter;
|
||||
|
||||
@@ -226,7 +226,7 @@ mod tests {
|
||||
columnar_writer.record_column_type("col1", ColumnType::Str, false);
|
||||
columnar_writer.record_column_type("col2", ColumnType::U64, false);
|
||||
let mut buffer = Vec::new();
|
||||
columnar_writer.serialize(1, None, &mut buffer).unwrap();
|
||||
columnar_writer.serialize(1, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
let columns = columnar.list_columns().unwrap();
|
||||
assert_eq!(columns.len(), 2);
|
||||
@@ -242,7 +242,7 @@ mod tests {
|
||||
columnar_writer.record_column_type("count", ColumnType::U64, false);
|
||||
columnar_writer.record_numerical(1, "count", 1u64);
|
||||
let mut buffer = Vec::new();
|
||||
columnar_writer.serialize(2, None, &mut buffer).unwrap();
|
||||
columnar_writer.serialize(2, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
let columns = columnar.list_columns().unwrap();
|
||||
assert_eq!(columns.len(), 1);
|
||||
@@ -256,7 +256,7 @@ mod tests {
|
||||
columnar_writer.record_column_type("col", ColumnType::U64, false);
|
||||
columnar_writer.record_numerical(1, "col", 1u64);
|
||||
let mut buffer = Vec::new();
|
||||
columnar_writer.serialize(2, None, &mut buffer).unwrap();
|
||||
columnar_writer.serialize(2, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
{
|
||||
let columns = columnar.read_columns("col").unwrap();
|
||||
@@ -285,7 +285,7 @@ mod tests {
|
||||
columnar_writer.record_str(1, "col1", "hello");
|
||||
columnar_writer.record_str(0, "col2", "hello");
|
||||
let mut buffer = Vec::new();
|
||||
columnar_writer.serialize(2, None, &mut buffer).unwrap();
|
||||
columnar_writer.serialize(2, &mut buffer).unwrap();
|
||||
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
{
|
||||
|
||||
@@ -244,7 +244,7 @@ impl SymbolValue for UnorderedId {
|
||||
|
||||
fn compute_num_bytes_for_u64(val: u64) -> usize {
|
||||
let msb = (64u32 - val.leading_zeros()) as usize;
|
||||
msb.div_ceil(8)
|
||||
(msb + 7) / 8
|
||||
}
|
||||
|
||||
fn encode_zig_zag(n: i64) -> u64 {
|
||||
|
||||
@@ -41,31 +41,10 @@ impl ColumnWriter {
|
||||
pub(super) fn operation_iterator<'a, V: SymbolValue>(
|
||||
&self,
|
||||
arena: &MemoryArena,
|
||||
old_to_new_ids_opt: Option<&[RowId]>,
|
||||
buffer: &'a mut Vec<u8>,
|
||||
) -> impl Iterator<Item = ColumnOperation<V>> + 'a + use<'a, V> {
|
||||
buffer.clear();
|
||||
self.values.read_to_end(arena, buffer);
|
||||
if let Some(old_to_new_ids) = old_to_new_ids_opt {
|
||||
// TODO avoid the extra deserialization / serialization.
|
||||
let mut sorted_ops: Vec<(RowId, ColumnOperation<V>)> = Vec::new();
|
||||
let mut new_doc = 0u32;
|
||||
let mut cursor = &buffer[..];
|
||||
for op in std::iter::from_fn(|| ColumnOperation::<V>::deserialize(&mut cursor)) {
|
||||
if let ColumnOperation::NewDoc(doc) = &op {
|
||||
new_doc = old_to_new_ids[*doc as usize];
|
||||
sorted_ops.push((new_doc, ColumnOperation::NewDoc(new_doc)));
|
||||
} else {
|
||||
sorted_ops.push((new_doc, op));
|
||||
}
|
||||
}
|
||||
// stable sort is crucial here.
|
||||
sorted_ops.sort_by_key(|(new_doc_id, _)| *new_doc_id);
|
||||
buffer.clear();
|
||||
for (_, op) in sorted_ops {
|
||||
buffer.extend_from_slice(op.serialize().as_ref());
|
||||
}
|
||||
}
|
||||
let mut cursor: &[u8] = &buffer[..];
|
||||
std::iter::from_fn(move || ColumnOperation::deserialize(&mut cursor))
|
||||
}
|
||||
@@ -232,11 +211,9 @@ impl NumericalColumnWriter {
|
||||
pub(super) fn operation_iterator<'a>(
|
||||
self,
|
||||
arena: &MemoryArena,
|
||||
old_to_new_ids: Option<&[RowId]>,
|
||||
buffer: &'a mut Vec<u8>,
|
||||
) -> impl Iterator<Item = ColumnOperation<NumericalValue>> + 'a + use<'a> {
|
||||
self.column_writer
|
||||
.operation_iterator(arena, old_to_new_ids, buffer)
|
||||
self.column_writer.operation_iterator(arena, buffer)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -278,11 +255,9 @@ impl StrOrBytesColumnWriter {
|
||||
pub(super) fn operation_iterator<'a>(
|
||||
&self,
|
||||
arena: &MemoryArena,
|
||||
old_to_new_ids: Option<&[RowId]>,
|
||||
byte_buffer: &'a mut Vec<u8>,
|
||||
) -> impl Iterator<Item = ColumnOperation<UnorderedId>> + 'a + use<'a> {
|
||||
self.column_writer
|
||||
.operation_iterator(arena, old_to_new_ids, byte_buffer)
|
||||
self.column_writer.operation_iterator(arena, byte_buffer)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -44,7 +44,7 @@ struct SpareBuffers {
|
||||
/// columnar_writer.record_str(1u32 /* doc id */, "product_name", "Apple");
|
||||
/// columnar_writer.record_numerical(0u32 /* doc id */, "price", 10.5f64); //< uh oh we ended up mixing integer and floats.
|
||||
/// let mut wrt: Vec<u8> = Vec::new();
|
||||
/// columnar_writer.serialize(2u32, None, &mut wrt).unwrap();
|
||||
/// columnar_writer.serialize(2u32, &mut wrt).unwrap();
|
||||
/// ```
|
||||
#[derive(Default)]
|
||||
pub struct ColumnarWriter {
|
||||
@@ -76,75 +76,6 @@ impl ColumnarWriter {
|
||||
.sum::<usize>()
|
||||
}
|
||||
|
||||
/// Returns the list of doc ids from 0..num_docs sorted by the `sort_field`
|
||||
/// column.
|
||||
///
|
||||
/// If the column is multivalued, use the first value for scoring.
|
||||
/// If no value is associated to a specific row, the document is assigned
|
||||
/// the lowest possible score.
|
||||
///
|
||||
/// The sort applied is stable.
|
||||
pub fn sort_order(&self, sort_field: &str, num_docs: RowId, reversed: bool) -> Vec<u32> {
|
||||
let Some(numerical_col_writer) = self
|
||||
.numerical_field_hash_map
|
||||
.get::<NumericalColumnWriter>(sort_field.as_bytes())
|
||||
.or_else(|| {
|
||||
self.datetime_field_hash_map
|
||||
.get::<NumericalColumnWriter>(sort_field.as_bytes())
|
||||
})
|
||||
else {
|
||||
let str_or_bytes_column_opt = self
|
||||
.str_field_hash_map
|
||||
.get::<StrOrBytesColumnWriter>(sort_field.as_bytes())
|
||||
.or_else(|| {
|
||||
self.bytes_field_hash_map
|
||||
.get::<StrOrBytesColumnWriter>(sort_field.as_bytes())
|
||||
});
|
||||
let Some(str_or_bytes_column) = str_or_bytes_column_opt else {
|
||||
return Vec::new();
|
||||
};
|
||||
|
||||
let dictionary_builder = &self.dictionaries[str_or_bytes_column.dictionary_id as usize];
|
||||
let term_id_mapping = dictionary_builder.build_term_id_mapping(&self.arena);
|
||||
let mut symbols_buffer = Vec::new();
|
||||
|
||||
return collect_sort_order_from_ops(
|
||||
str_or_bytes_column.operation_iterator(&self.arena, None, &mut symbols_buffer),
|
||||
num_docs,
|
||||
reversed,
|
||||
|uid| Some(term_id_mapping.to_ord(uid).0),
|
||||
None,
|
||||
|a, b| a.cmp(b),
|
||||
);
|
||||
};
|
||||
let mut symbols_buffer = Vec::new();
|
||||
collect_sort_order_from_ops(
|
||||
numerical_col_writer.operation_iterator(&self.arena, None, &mut symbols_buffer),
|
||||
num_docs,
|
||||
reversed,
|
||||
// MonotonicallyMappableToU64 converts each value to u64 in an
|
||||
// order-preserving way (u64: identity, i64: XOR sign bit, f64: bit
|
||||
// manipulation). Converting once per document lets the comparator be
|
||||
// a simple u64 cmp instead of unwrapping the NumericalValue variant
|
||||
// on every comparison.
|
||||
//
|
||||
// For f64, NaN maps to a deterministic u64 via raw bit manipulation,
|
||||
// so it sorts to a consistent position. Sorting only requires total
|
||||
// ordering, not IEEE 754 equality semantics where NaN != NaN.
|
||||
|nv| {
|
||||
Some(match nv {
|
||||
NumericalValue::U64(v) => v.to_u64(),
|
||||
NumericalValue::I64(v) => v.to_u64(),
|
||||
NumericalValue::F64(v) => v.to_u64(),
|
||||
})
|
||||
},
|
||||
// None for missing values. Option<u64> sorts None < Some(_),
|
||||
// placing nulls before non-null values.
|
||||
None,
|
||||
|a, b| a.cmp(b),
|
||||
)
|
||||
}
|
||||
|
||||
/// Records a column type. This is useful to bypass the coercion process,
|
||||
/// makes sure the empty is present in the resulting columnar, or set
|
||||
/// the `sort_values_within_row`.
|
||||
@@ -315,12 +246,7 @@ impl ColumnarWriter {
|
||||
},
|
||||
);
|
||||
}
|
||||
pub fn serialize(
|
||||
&mut self,
|
||||
num_docs: RowId,
|
||||
old_to_new_row_ids: Option<&[RowId]>,
|
||||
wrt: &mut dyn io::Write,
|
||||
) -> io::Result<()> {
|
||||
pub fn serialize(&mut self, num_docs: RowId, wrt: &mut dyn io::Write) -> io::Result<()> {
|
||||
let mut serializer = ColumnarSerializer::new(wrt);
|
||||
|
||||
let mut columns: Vec<(&[u8], ColumnType, Addr)> = self
|
||||
@@ -377,11 +303,7 @@ impl ColumnarWriter {
|
||||
serialize_bool_column(
|
||||
cardinality,
|
||||
num_docs,
|
||||
column_writer.operation_iterator(
|
||||
arena,
|
||||
old_to_new_row_ids,
|
||||
&mut symbol_byte_buffer,
|
||||
),
|
||||
column_writer.operation_iterator(arena, &mut symbol_byte_buffer),
|
||||
buffers,
|
||||
&mut column_serializer,
|
||||
)?;
|
||||
@@ -395,11 +317,7 @@ impl ColumnarWriter {
|
||||
serialize_ip_addr_column(
|
||||
cardinality,
|
||||
num_docs,
|
||||
column_writer.operation_iterator(
|
||||
arena,
|
||||
old_to_new_row_ids,
|
||||
&mut symbol_byte_buffer,
|
||||
),
|
||||
column_writer.operation_iterator(arena, &mut symbol_byte_buffer),
|
||||
buffers,
|
||||
&mut column_serializer,
|
||||
)?;
|
||||
@@ -424,11 +342,8 @@ impl ColumnarWriter {
|
||||
num_docs,
|
||||
str_or_bytes_column_writer.sort_values_within_row,
|
||||
dictionary_builder,
|
||||
str_or_bytes_column_writer.operation_iterator(
|
||||
arena,
|
||||
old_to_new_row_ids,
|
||||
&mut symbol_byte_buffer,
|
||||
),
|
||||
str_or_bytes_column_writer
|
||||
.operation_iterator(arena, &mut symbol_byte_buffer),
|
||||
buffers,
|
||||
&self.arena,
|
||||
&mut column_serializer,
|
||||
@@ -446,11 +361,7 @@ impl ColumnarWriter {
|
||||
cardinality,
|
||||
num_docs,
|
||||
numerical_type,
|
||||
numerical_column_writer.operation_iterator(
|
||||
arena,
|
||||
old_to_new_row_ids,
|
||||
&mut symbol_byte_buffer,
|
||||
),
|
||||
numerical_column_writer.operation_iterator(arena, &mut symbol_byte_buffer),
|
||||
buffers,
|
||||
&mut column_serializer,
|
||||
)?;
|
||||
@@ -465,11 +376,7 @@ impl ColumnarWriter {
|
||||
cardinality,
|
||||
num_docs,
|
||||
NumericalType::I64,
|
||||
column_writer.operation_iterator(
|
||||
arena,
|
||||
old_to_new_row_ids,
|
||||
&mut symbol_byte_buffer,
|
||||
),
|
||||
column_writer.operation_iterator(arena, &mut symbol_byte_buffer),
|
||||
buffers,
|
||||
&mut column_serializer,
|
||||
)?;
|
||||
@@ -482,56 +389,6 @@ impl ColumnarWriter {
|
||||
}
|
||||
}
|
||||
|
||||
/// Shared sorting pattern for both numeric and Str/Bytes sort fields.
|
||||
///
|
||||
/// Iterates column operations, fills gaps for missing docs with `default_key`, converts each value
|
||||
/// to a sort key via `value_to_key`, then sorts by the key using `cmp_keys`. Returns the doc ids
|
||||
/// in sorted order.
|
||||
fn collect_sort_order_from_ops<V, K: Clone>(
|
||||
ops: impl Iterator<Item = ColumnOperation<V>>,
|
||||
num_docs: RowId,
|
||||
reversed: bool,
|
||||
value_to_key: impl Fn(V) -> K,
|
||||
default_key: K,
|
||||
cmp_keys: impl Fn(&K, &K) -> std::cmp::Ordering,
|
||||
) -> Vec<u32> {
|
||||
let mut doc_sort_keys: Vec<(K, RowId)> = Vec::with_capacity(num_docs as usize);
|
||||
let mut start_doc_check_fill: RowId = 0;
|
||||
let mut current_doc_opt: Option<RowId> = None;
|
||||
|
||||
for op in ops {
|
||||
match op {
|
||||
ColumnOperation::NewDoc(doc) => {
|
||||
current_doc_opt = Some(doc);
|
||||
}
|
||||
ColumnOperation::Value(val) => {
|
||||
if let Some(current_doc) = current_doc_opt {
|
||||
// Fill gaps since the last doc with the default key.
|
||||
doc_sort_keys.extend(
|
||||
(start_doc_check_fill..current_doc).map(|doc| (default_key.clone(), doc)),
|
||||
);
|
||||
start_doc_check_fill = current_doc + 1;
|
||||
// For multivalued fields, only the first value is used.
|
||||
current_doc_opt = None;
|
||||
|
||||
doc_sort_keys.push((value_to_key(val), current_doc));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
// Fill remaining docs at the tail.
|
||||
doc_sort_keys.extend((start_doc_check_fill..num_docs).map(|doc| (default_key.clone(), doc)));
|
||||
|
||||
doc_sort_keys.sort_by(|(left_key, _), (right_key, _)| {
|
||||
let cmp = cmp_keys(left_key, right_key);
|
||||
if reversed { cmp.reverse() } else { cmp }
|
||||
});
|
||||
doc_sort_keys
|
||||
.into_iter()
|
||||
.map(|(_sort_key, doc)| doc)
|
||||
.collect()
|
||||
}
|
||||
|
||||
// Serialize [Dictionary, Column, dictionary num bytes U32::LE]
|
||||
// Column: [Column Index, Column Values, column index num bytes U32::LE]
|
||||
#[expect(clippy::too_many_arguments)]
|
||||
@@ -832,7 +689,7 @@ mod tests {
|
||||
assert_eq!(column_writer.get_cardinality(3), Cardinality::Full);
|
||||
let mut buffer = Vec::new();
|
||||
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
|
||||
.operation_iterator(&arena, None, &mut buffer)
|
||||
.operation_iterator(&arena, &mut buffer)
|
||||
.collect();
|
||||
assert_eq!(symbols.len(), 6);
|
||||
assert!(matches!(symbols[0], ColumnOperation::NewDoc(0u32)));
|
||||
@@ -861,7 +718,7 @@ mod tests {
|
||||
assert_eq!(column_writer.get_cardinality(3), Cardinality::Optional);
|
||||
let mut buffer = Vec::new();
|
||||
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
|
||||
.operation_iterator(&arena, None, &mut buffer)
|
||||
.operation_iterator(&arena, &mut buffer)
|
||||
.collect();
|
||||
assert_eq!(symbols.len(), 4);
|
||||
assert!(matches!(symbols[0], ColumnOperation::NewDoc(1u32)));
|
||||
@@ -884,7 +741,7 @@ mod tests {
|
||||
assert_eq!(column_writer.get_cardinality(2), Cardinality::Optional);
|
||||
let mut buffer = Vec::new();
|
||||
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
|
||||
.operation_iterator(&arena, None, &mut buffer)
|
||||
.operation_iterator(&arena, &mut buffer)
|
||||
.collect();
|
||||
assert_eq!(symbols.len(), 2);
|
||||
assert!(matches!(symbols[0], ColumnOperation::NewDoc(0u32)));
|
||||
@@ -903,7 +760,7 @@ mod tests {
|
||||
assert_eq!(column_writer.get_cardinality(1), Cardinality::Multivalued);
|
||||
let mut buffer = Vec::new();
|
||||
let symbols: Vec<ColumnOperation<NumericalValue>> = column_writer
|
||||
.operation_iterator(&arena, None, &mut buffer)
|
||||
.operation_iterator(&arena, &mut buffer)
|
||||
.collect();
|
||||
assert_eq!(symbols.len(), 3);
|
||||
assert!(matches!(symbols[0], ColumnOperation::NewDoc(0u32)));
|
||||
|
||||
@@ -27,7 +27,7 @@ fn generate_columnar(num_docs: u32, value_offset: u64) -> Vec<u8> {
|
||||
}
|
||||
|
||||
let mut wrt: Vec<u8> = Vec::new();
|
||||
columnar_writer.serialize(num_docs, None, &mut wrt).unwrap();
|
||||
columnar_writer.serialize(num_docs, &mut wrt).unwrap();
|
||||
|
||||
wrt
|
||||
}
|
||||
|
||||
@@ -51,16 +51,6 @@ impl DictionaryBuilder {
|
||||
UnorderedId(unordered_id)
|
||||
}
|
||||
|
||||
fn build_sorted_terms<'a>(&'a self, arena: &'a MemoryArena) -> Vec<(&'a [u8], UnorderedId)> {
|
||||
let mut terms: Vec<(&[u8], UnorderedId)> = self
|
||||
.dict
|
||||
.iter(arena)
|
||||
.map(|(k, v)| (k, arena.read(v)))
|
||||
.collect();
|
||||
terms.sort_unstable_by_key(|(key, _)| *key);
|
||||
terms
|
||||
}
|
||||
|
||||
/// Serialize the dictionary into an fst, and returns the
|
||||
/// `UnorderedId -> TermOrdinal` map.
|
||||
pub fn serialize<'a, W: io::Write + 'a>(
|
||||
@@ -68,7 +58,12 @@ impl DictionaryBuilder {
|
||||
arena: &MemoryArena,
|
||||
wrt: &mut W,
|
||||
) -> io::Result<TermIdMapping> {
|
||||
let terms = self.build_sorted_terms(arena);
|
||||
let mut terms: Vec<(&[u8], UnorderedId)> = self
|
||||
.dict
|
||||
.iter(arena)
|
||||
.map(|(k, v)| (k, arena.read(v)))
|
||||
.collect();
|
||||
terms.sort_unstable_by_key(|(key, _)| *key);
|
||||
// TODO Remove the allocation.
|
||||
let mut unordered_to_ord: Vec<OrderedId> = vec![OrderedId(0u32); terms.len()];
|
||||
let mut sstable_builder = sstable::VoidSSTable::writer(wrt);
|
||||
@@ -81,16 +76,6 @@ impl DictionaryBuilder {
|
||||
Ok(TermIdMapping { unordered_to_ord })
|
||||
}
|
||||
|
||||
/// Build the `UnorderedId -> OrderedId` mapping in memory without serializing.
|
||||
pub fn build_term_id_mapping(&self, arena: &MemoryArena) -> TermIdMapping {
|
||||
let terms = self.build_sorted_terms(arena);
|
||||
let mut unordered_to_ord: Vec<OrderedId> = vec![OrderedId(0u32); terms.len()];
|
||||
for (ord, (_key, unordered_id)) in terms.into_iter().enumerate() {
|
||||
unordered_to_ord[unordered_id.0 as usize] = OrderedId(ord as u32);
|
||||
}
|
||||
TermIdMapping { unordered_to_ord }
|
||||
}
|
||||
|
||||
pub(crate) fn mem_usage(&self) -> usize {
|
||||
self.dict.mem_usage()
|
||||
}
|
||||
|
||||
@@ -3,8 +3,7 @@ use std::sync::Arc;
|
||||
use std::{fmt, io};
|
||||
|
||||
use common::file_slice::FileSlice;
|
||||
use common::{ByteCount, DateTime, OwnedBytes};
|
||||
use serde::{Deserialize, Serialize};
|
||||
use common::{ByteCount, DateTime, HasLen, OwnedBytes};
|
||||
|
||||
use crate::column::{BytesColumn, Column, StrColumn};
|
||||
use crate::column_values::{StrictlyMonotonicFn, monotonic_map_column};
|
||||
@@ -318,89 +317,10 @@ impl DynamicColumnHandle {
|
||||
}
|
||||
|
||||
pub fn num_bytes(&self) -> ByteCount {
|
||||
self.file_slice.num_bytes()
|
||||
}
|
||||
|
||||
/// Legacy helper returning the column space usage.
|
||||
pub fn column_and_dictionary_num_bytes(&self) -> io::Result<ColumnSpaceUsage> {
|
||||
self.space_usage()
|
||||
}
|
||||
|
||||
/// Return the space usage of the column, optionally broken down by dictionary and column
|
||||
/// values.
|
||||
///
|
||||
/// For dictionary encoded columns (strings and bytes), this splits the total footprint into
|
||||
/// the dictionary and the remaining column data (including index and values).
|
||||
/// For all other column types, the dictionary size is `None` and the column size
|
||||
/// equals the total bytes.
|
||||
pub fn space_usage(&self) -> io::Result<ColumnSpaceUsage> {
|
||||
let total_num_bytes = self.num_bytes();
|
||||
let dynamic_column = self.open()?;
|
||||
let dictionary_num_bytes = match &dynamic_column {
|
||||
DynamicColumn::Bytes(bytes_column) => bytes_column.dictionary().num_bytes(),
|
||||
DynamicColumn::Str(str_column) => str_column.dictionary().num_bytes(),
|
||||
_ => {
|
||||
return Ok(ColumnSpaceUsage::new(self.num_bytes(), None));
|
||||
}
|
||||
};
|
||||
assert!(dictionary_num_bytes <= total_num_bytes);
|
||||
let column_num_bytes =
|
||||
ByteCount::from(total_num_bytes.get_bytes() - dictionary_num_bytes.get_bytes());
|
||||
Ok(ColumnSpaceUsage::new(
|
||||
column_num_bytes,
|
||||
Some(dictionary_num_bytes),
|
||||
))
|
||||
self.file_slice.len().into()
|
||||
}
|
||||
|
||||
pub fn column_type(&self) -> ColumnType {
|
||||
self.column_type
|
||||
}
|
||||
}
|
||||
|
||||
/// Represents space usage of a column.
|
||||
///
|
||||
/// `column_num_bytes` tracks the column payload (index, values and footer).
|
||||
/// For dictionary encoded columns, `dictionary_num_bytes` captures the dictionary footprint.
|
||||
/// [`ColumnSpaceUsage::total_num_bytes`] returns the sum of both parts.
|
||||
#[derive(Clone, Debug, Serialize, Deserialize)]
|
||||
pub struct ColumnSpaceUsage {
|
||||
column_num_bytes: ByteCount,
|
||||
dictionary_num_bytes: Option<ByteCount>,
|
||||
}
|
||||
|
||||
impl ColumnSpaceUsage {
|
||||
pub(crate) fn new(
|
||||
column_num_bytes: ByteCount,
|
||||
dictionary_num_bytes: Option<ByteCount>,
|
||||
) -> Self {
|
||||
ColumnSpaceUsage {
|
||||
column_num_bytes,
|
||||
dictionary_num_bytes,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn column_num_bytes(&self) -> ByteCount {
|
||||
self.column_num_bytes
|
||||
}
|
||||
|
||||
pub fn dictionary_num_bytes(&self) -> Option<ByteCount> {
|
||||
self.dictionary_num_bytes
|
||||
}
|
||||
|
||||
pub fn total_num_bytes(&self) -> ByteCount {
|
||||
self.column_num_bytes + self.dictionary_num_bytes.unwrap_or_default()
|
||||
}
|
||||
|
||||
/// Merge two space usage values by summing their components.
|
||||
pub fn merge(&self, other: &ColumnSpaceUsage) -> ColumnSpaceUsage {
|
||||
let dictionary_num_bytes = match (self.dictionary_num_bytes, other.dictionary_num_bytes) {
|
||||
(Some(lhs), Some(rhs)) => Some(lhs + rhs),
|
||||
(Some(val), None) | (None, Some(val)) => Some(val),
|
||||
(None, None) => None,
|
||||
};
|
||||
ColumnSpaceUsage {
|
||||
column_num_bytes: self.column_num_bytes + other.column_num_bytes,
|
||||
dictionary_num_bytes,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -43,13 +43,12 @@ pub use column_values::{
|
||||
};
|
||||
pub use columnar::{
|
||||
CURRENT_VERSION, ColumnType, ColumnarReader, ColumnarWriter, HasAssociatedColumnType,
|
||||
MergeRowOrder, ShuffleMergeOrder, StackMergeOrder, Version, compute_merged_term_ord_mapping,
|
||||
merge_columnar,
|
||||
MergeRowOrder, ShuffleMergeOrder, StackMergeOrder, Version, merge_columnar,
|
||||
};
|
||||
use sstable::VoidSSTable;
|
||||
pub use value::{NumericalType, NumericalValue};
|
||||
|
||||
pub use self::dynamic_column::{ColumnSpaceUsage, DynamicColumn, DynamicColumnHandle};
|
||||
pub use self::dynamic_column::{DynamicColumn, DynamicColumnHandle};
|
||||
|
||||
pub type RowId = u32;
|
||||
pub type DocId = u32;
|
||||
@@ -60,7 +59,7 @@ pub struct RowAddr {
|
||||
pub row_id: RowId,
|
||||
}
|
||||
|
||||
pub use sstable::{Dictionary, TermOrdHit};
|
||||
pub use sstable::Dictionary;
|
||||
pub type Streamer<'a> = sstable::Streamer<'a, VoidSSTable>;
|
||||
|
||||
pub use common::DateTime;
|
||||
|
||||
@@ -21,7 +21,7 @@ fn test_dataframe_writer_str() {
|
||||
dataframe_writer.record_str(1u32, "my_string", "hello");
|
||||
dataframe_writer.record_str(3u32, "my_string", "helloeee");
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer.serialize(5, None, &mut buffer).unwrap();
|
||||
dataframe_writer.serialize(5, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar.num_columns(), 1);
|
||||
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap();
|
||||
@@ -35,7 +35,7 @@ fn test_dataframe_writer_bytes() {
|
||||
dataframe_writer.record_bytes(1u32, "my_string", b"hello");
|
||||
dataframe_writer.record_bytes(3u32, "my_string", b"helloeee");
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer.serialize(5, None, &mut buffer).unwrap();
|
||||
dataframe_writer.serialize(5, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar.num_columns(), 1);
|
||||
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("my_string").unwrap();
|
||||
@@ -49,7 +49,7 @@ fn test_dataframe_writer_bool() {
|
||||
dataframe_writer.record_bool(1u32, "bool.value", false);
|
||||
dataframe_writer.record_bool(3u32, "bool.value", true);
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer.serialize(5, None, &mut buffer).unwrap();
|
||||
dataframe_writer.serialize(5, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar.num_columns(), 1);
|
||||
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("bool.value").unwrap();
|
||||
@@ -60,7 +60,7 @@ fn test_dataframe_writer_bool() {
|
||||
let DynamicColumn::Bool(bool_col) = dyn_bool_col else {
|
||||
panic!();
|
||||
};
|
||||
let vals: Vec<Option<bool>> = (0..5).map(|doc_id| bool_col.first(doc_id)).collect();
|
||||
let vals: Vec<Option<bool>> = (0..5).map(|row_id| bool_col.first(row_id)).collect();
|
||||
assert_eq!(&vals, &[None, Some(false), None, Some(true), None,]);
|
||||
}
|
||||
|
||||
@@ -74,7 +74,7 @@ fn test_dataframe_writer_u64_multivalued() {
|
||||
dataframe_writer.record_numerical(6u32, "divisor", 2u64);
|
||||
dataframe_writer.record_numerical(6u32, "divisor", 3u64);
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer.serialize(7, None, &mut buffer).unwrap();
|
||||
dataframe_writer.serialize(7, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar.num_columns(), 1);
|
||||
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("divisor").unwrap();
|
||||
@@ -97,7 +97,7 @@ fn test_dataframe_writer_ip_addr() {
|
||||
dataframe_writer.record_ip_addr(1, "ip_addr", Ipv6Addr::from_u128(1001));
|
||||
dataframe_writer.record_ip_addr(3, "ip_addr", Ipv6Addr::from_u128(1050));
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer.serialize(5, None, &mut buffer).unwrap();
|
||||
dataframe_writer.serialize(5, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar.num_columns(), 1);
|
||||
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("ip_addr").unwrap();
|
||||
@@ -108,7 +108,7 @@ fn test_dataframe_writer_ip_addr() {
|
||||
let DynamicColumn::IpAddr(ip_col) = dyn_bool_col else {
|
||||
panic!();
|
||||
};
|
||||
let vals: Vec<Option<Ipv6Addr>> = (0..5).map(|doc_id| ip_col.first(doc_id)).collect();
|
||||
let vals: Vec<Option<Ipv6Addr>> = (0..5).map(|row_id| ip_col.first(row_id)).collect();
|
||||
assert_eq!(
|
||||
&vals,
|
||||
&[
|
||||
@@ -128,7 +128,7 @@ fn test_dataframe_writer_numerical() {
|
||||
dataframe_writer.record_numerical(2u32, "srical.value", NumericalValue::U64(13u64));
|
||||
dataframe_writer.record_numerical(4u32, "srical.value", NumericalValue::U64(15u64));
|
||||
let mut buffer: Vec<u8> = Vec::new();
|
||||
dataframe_writer.serialize(6, None, &mut buffer).unwrap();
|
||||
dataframe_writer.serialize(6, &mut buffer).unwrap();
|
||||
let columnar = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar.num_columns(), 1);
|
||||
let cols: Vec<DynamicColumnHandle> = columnar.read_columns("srical.value").unwrap();
|
||||
@@ -153,46 +153,6 @@ fn test_dataframe_writer_numerical() {
|
||||
assert_eq!(column_i64.first(6), None); //< we can change the spec for that one.
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dataframe_sort_by_full() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_numerical(0u32, "value", NumericalValue::U64(1));
|
||||
dataframe_writer.record_numerical(1u32, "value", NumericalValue::U64(2));
|
||||
let data = dataframe_writer.sort_order("value", 2, false);
|
||||
assert_eq!(data, vec![0, 1]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dataframe_sort_by_opt() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_numerical(1u32, "value", NumericalValue::U64(3));
|
||||
dataframe_writer.record_numerical(3u32, "value", NumericalValue::U64(2));
|
||||
let data = dataframe_writer.sort_order("value", 5, false);
|
||||
// 0, 2, 4 is 0.0
|
||||
assert_eq!(data, vec![0, 2, 4, 3, 1]);
|
||||
let data = dataframe_writer.sort_order("value", 5, true);
|
||||
assert_eq!(
|
||||
data,
|
||||
vec![4, 2, 0, 3, 1].into_iter().rev().collect::<Vec<_>>()
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dataframe_sort_by_multi() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
// valid for sort
|
||||
dataframe_writer.record_numerical(1u32, "value", NumericalValue::U64(2));
|
||||
// those are ignored for sort
|
||||
dataframe_writer.record_numerical(1u32, "value", NumericalValue::U64(4));
|
||||
dataframe_writer.record_numerical(1u32, "value", NumericalValue::U64(4));
|
||||
// valid for sort
|
||||
dataframe_writer.record_numerical(3u32, "value", NumericalValue::U64(3));
|
||||
// ignored, would change sort order
|
||||
dataframe_writer.record_numerical(3u32, "value", NumericalValue::U64(1));
|
||||
let data = dataframe_writer.sort_order("value", 4, false);
|
||||
assert_eq!(data, vec![0, 2, 1, 3]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dictionary_encoded_str() {
|
||||
let mut buffer = Vec::new();
|
||||
@@ -201,7 +161,7 @@ fn test_dictionary_encoded_str() {
|
||||
columnar_writer.record_str(3, "my.column", "c");
|
||||
columnar_writer.record_str(3, "my.column2", "different_column!");
|
||||
columnar_writer.record_str(4, "my.column", "b");
|
||||
columnar_writer.serialize(5, None, &mut buffer).unwrap();
|
||||
columnar_writer.serialize(5, &mut buffer).unwrap();
|
||||
let columnar_reader = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar_reader.num_columns(), 2);
|
||||
let col_handles = columnar_reader.read_columns("my.column").unwrap();
|
||||
@@ -209,7 +169,7 @@ fn test_dictionary_encoded_str() {
|
||||
let DynamicColumn::Str(str_col) = col_handles[0].open().unwrap() else {
|
||||
panic!();
|
||||
};
|
||||
let index: Vec<Option<u64>> = (0..5).map(|doc_id| str_col.ords().first(doc_id)).collect();
|
||||
let index: Vec<Option<u64>> = (0..5).map(|row_id| str_col.ords().first(row_id)).collect();
|
||||
assert_eq!(index, &[None, Some(0), None, Some(2), Some(1)]);
|
||||
assert_eq!(str_col.num_rows(), 5);
|
||||
let mut term_buffer = String::new();
|
||||
@@ -235,7 +195,7 @@ fn test_dictionary_encoded_bytes() {
|
||||
columnar_writer.record_bytes(3, "my.column", b"c");
|
||||
columnar_writer.record_bytes(3, "my.column2", b"different_column!");
|
||||
columnar_writer.record_bytes(4, "my.column", b"b");
|
||||
columnar_writer.serialize(5, None, &mut buffer).unwrap();
|
||||
columnar_writer.serialize(5, &mut buffer).unwrap();
|
||||
let columnar_reader = ColumnarReader::open(buffer).unwrap();
|
||||
assert_eq!(columnar_reader.num_columns(), 2);
|
||||
let col_handles = columnar_reader.read_columns("my.column").unwrap();
|
||||
@@ -244,7 +204,7 @@ fn test_dictionary_encoded_bytes() {
|
||||
panic!();
|
||||
};
|
||||
let index: Vec<Option<u64>> = (0..5)
|
||||
.map(|doc_id| bytes_col.ords().first(doc_id))
|
||||
.map(|row_id| bytes_col.ords().first(row_id))
|
||||
.collect();
|
||||
assert_eq!(index, &[None, Some(0), None, Some(2), Some(1)]);
|
||||
assert_eq!(bytes_col.num_rows(), 5);
|
||||
@@ -272,93 +232,6 @@ fn test_dictionary_encoded_bytes() {
|
||||
assert_eq!(term_buffer, b"b");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_order_str_asc_desc() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_str(0, "s", "z");
|
||||
dataframe_writer.record_str(2, "s", "a");
|
||||
dataframe_writer.record_str(3, "s", "m");
|
||||
|
||||
let asc = dataframe_writer.sort_order("s", 4, false);
|
||||
assert_eq!(asc, vec![1, 2, 3, 0]); // None, a, m, z
|
||||
|
||||
let desc = dataframe_writer.sort_order("s", 4, true);
|
||||
assert_eq!(desc, vec![0, 3, 2, 1]); // z, m, a, None
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_order_str_empty_vs_missing() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_str(0, "s", "");
|
||||
|
||||
let asc = dataframe_writer.sort_order("s", 2, false);
|
||||
assert_eq!(asc, vec![1, 0]); // None first, then empty string
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_order_str_multivalued_stable() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_str(0, "s", "z");
|
||||
dataframe_writer.record_str(0, "s", "a"); // multivalued; first value wins
|
||||
dataframe_writer.record_str(1, "s", "b");
|
||||
dataframe_writer.record_str(2, "s", "b");
|
||||
|
||||
let asc = dataframe_writer.sort_order("s", 3, false);
|
||||
assert_eq!(asc, vec![1, 2, 0]); // b, b (stable), z
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_order_bytes_asc() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_bytes(1, "b", &[0x01]);
|
||||
dataframe_writer.record_bytes(3, "b", &[0x00]);
|
||||
|
||||
let asc = dataframe_writer.sort_order("b", 4, false);
|
||||
assert_eq!(asc, vec![0, 2, 3, 1]); // None, None, 0x00, 0x01
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_order_numeric_u64_above_2_24() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_numerical(0, "n", 16_777_217u64);
|
||||
dataframe_writer.record_numerical(1, "n", 16_777_216u64);
|
||||
|
||||
let asc = dataframe_writer.sort_order("n", 2, false);
|
||||
assert_eq!(asc, vec![1, 0]); // 16,777,216 then 16,777,217
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_order_numeric_u64_above_2_53() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_numerical(0, "n", 9_007_199_254_740_993u64);
|
||||
dataframe_writer.record_numerical(1, "n", 9_007_199_254_740_992u64);
|
||||
|
||||
let asc = dataframe_writer.sort_order("n", 2, false);
|
||||
assert_eq!(asc, vec![1, 0]); // smaller value first
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_order_numeric_null_vs_zero() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
dataframe_writer.record_numerical(0, "n", 0u64);
|
||||
|
||||
let asc = dataframe_writer.sort_order("n", 2, false);
|
||||
assert_eq!(asc, vec![1, 0]); // None first, then 0
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sort_order_datetime_close_timestamps() {
|
||||
let mut dataframe_writer = ColumnarWriter::default();
|
||||
// Two timestamps 1 nanosecond apart. As f32, both round to the same value.
|
||||
let dt1 = DateTime::from_timestamp_nanos(1_700_000_000_000_000_001);
|
||||
let dt2 = DateTime::from_timestamp_nanos(1_700_000_000_000_000_000);
|
||||
dataframe_writer.record_datetime(0, "ts", dt1);
|
||||
dataframe_writer.record_datetime(1, "ts", dt2);
|
||||
|
||||
let asc = dataframe_writer.sort_order("ts", 2, false);
|
||||
assert_eq!(asc, vec![1, 0]); // smaller timestamp first
|
||||
}
|
||||
|
||||
fn num_strategy() -> impl Strategy<Value = NumericalValue> {
|
||||
prop_oneof![
|
||||
3 => Just(NumericalValue::U64(0u64)),
|
||||
@@ -456,26 +329,12 @@ fn columnar_docs_strategy() -> impl Strategy<Value = Vec<Vec<(&'static str, Colu
|
||||
.prop_flat_map(|num_docs| proptest::collection::vec(doc_strategy(), num_docs))
|
||||
}
|
||||
|
||||
fn columnar_docs_and_mapping_strategy()
|
||||
-> impl Strategy<Value = (Vec<Vec<(&'static str, ColumnValue)>>, Vec<RowId>)> {
|
||||
columnar_docs_strategy().prop_flat_map(|docs| {
|
||||
permutation_strategy(docs.len()).prop_map(move |permutation| (docs.clone(), permutation))
|
||||
})
|
||||
}
|
||||
|
||||
fn permutation_strategy(n: usize) -> impl Strategy<Value = Vec<RowId>> {
|
||||
Just((0u32..n as RowId).collect()).prop_shuffle()
|
||||
}
|
||||
|
||||
fn permutation_and_subset_strategy(n: usize) -> impl Strategy<Value = Vec<usize>> {
|
||||
let vals: Vec<usize> = (0..n).collect();
|
||||
subsequence(vals, 0..=n).prop_shuffle()
|
||||
}
|
||||
|
||||
fn build_columnar_with_mapping(
|
||||
docs: &[Vec<(&'static str, ColumnValue)>],
|
||||
old_to_new_row_ids_opt: Option<&[RowId]>,
|
||||
) -> ColumnarReader {
|
||||
fn build_columnar_with_mapping(docs: &[Vec<(&'static str, ColumnValue)>]) -> ColumnarReader {
|
||||
let num_docs = docs.len() as u32;
|
||||
let mut buffer = Vec::new();
|
||||
let mut columnar_writer = ColumnarWriter::default();
|
||||
@@ -503,15 +362,13 @@ fn build_columnar_with_mapping(
|
||||
}
|
||||
}
|
||||
}
|
||||
columnar_writer
|
||||
.serialize(num_docs, old_to_new_row_ids_opt, &mut buffer)
|
||||
.unwrap();
|
||||
columnar_writer.serialize(num_docs, &mut buffer).unwrap();
|
||||
|
||||
ColumnarReader::open(buffer).unwrap()
|
||||
}
|
||||
|
||||
fn build_columnar(docs: &[Vec<(&'static str, ColumnValue)>]) -> ColumnarReader {
|
||||
build_columnar_with_mapping(docs, None)
|
||||
build_columnar_with_mapping(docs)
|
||||
}
|
||||
|
||||
fn assert_columnar_eq_strict(left: &ColumnarReader, right: &ColumnarReader) {
|
||||
@@ -771,54 +628,6 @@ proptest! {
|
||||
}
|
||||
}
|
||||
|
||||
// Same as `test_single_columnar_builder_proptest` but with a shuffling mapping.
|
||||
proptest! {
|
||||
#![proptest_config(ProptestConfig::with_cases(500))]
|
||||
#[test]
|
||||
fn test_single_columnar_builder_with_shuffle_proptest((docs, mapping) in columnar_docs_and_mapping_strategy()) {
|
||||
let columnar = build_columnar_with_mapping(&docs[..], Some(&mapping));
|
||||
assert_eq!(columnar.num_docs() as usize, docs.len());
|
||||
let mut expected_columns: HashMap<(&str, ColumnTypeCategory), HashMap<u32, Vec<&ColumnValue>> > = Default::default();
|
||||
for (doc_id, doc_vals) in docs.iter().enumerate() {
|
||||
for (col_name, col_val) in doc_vals {
|
||||
expected_columns
|
||||
.entry((col_name, col_val.column_type_category()))
|
||||
.or_default()
|
||||
.entry(mapping[doc_id])
|
||||
.or_default()
|
||||
.push(col_val);
|
||||
}
|
||||
}
|
||||
let column_list = columnar.list_columns().unwrap();
|
||||
assert_eq!(expected_columns.len(), column_list.len());
|
||||
for (column_name, column) in column_list {
|
||||
let dynamic_column = column.open().unwrap();
|
||||
let col_category: ColumnTypeCategory = dynamic_column.column_type().into();
|
||||
let expected_col_values: &HashMap<u32, Vec<&ColumnValue>> = expected_columns.get(&(column_name.as_str(), col_category)).unwrap();
|
||||
for _doc_id in 0..columnar.num_docs() {
|
||||
match &dynamic_column {
|
||||
DynamicColumn::Bool(col) =>
|
||||
assert_column_values(col, expected_col_values),
|
||||
DynamicColumn::I64(col) =>
|
||||
assert_column_values(col, expected_col_values),
|
||||
DynamicColumn::U64(col) =>
|
||||
assert_column_values(col, expected_col_values),
|
||||
DynamicColumn::F64(col) =>
|
||||
assert_column_values(col, expected_col_values),
|
||||
DynamicColumn::IpAddr(col) =>
|
||||
assert_column_values(col, expected_col_values),
|
||||
DynamicColumn::DateTime(col) =>
|
||||
assert_column_values(col, expected_col_values),
|
||||
DynamicColumn::Bytes(col) =>
|
||||
assert_bytes_column_values(col, expected_col_values, false),
|
||||
DynamicColumn::Str(col) =>
|
||||
assert_bytes_column_values(col, expected_col_values, true),
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// This tests create 2 or 3 random small columnar and attempts to merge them.
|
||||
// It compares the resulting merged dataframe with what would have been obtained by building the
|
||||
// dataframe from the concatenated rows to begin with.
|
||||
|
||||
@@ -1,5 +1,3 @@
|
||||
use std::str::FromStr;
|
||||
|
||||
use common::DateTime;
|
||||
|
||||
use crate::InvalidData;
|
||||
@@ -11,23 +9,6 @@ pub enum NumericalValue {
|
||||
F64(f64),
|
||||
}
|
||||
|
||||
impl FromStr for NumericalValue {
|
||||
type Err = ();
|
||||
|
||||
fn from_str(s: &str) -> Result<Self, ()> {
|
||||
if let Ok(val_i64) = s.parse::<i64>() {
|
||||
return Ok(val_i64.into());
|
||||
}
|
||||
if let Ok(val_u64) = s.parse::<u64>() {
|
||||
return Ok(val_u64.into());
|
||||
}
|
||||
if let Ok(val_f64) = s.parse::<f64>() {
|
||||
return Ok(NumericalValue::from(val_f64).normalize());
|
||||
}
|
||||
Err(())
|
||||
}
|
||||
}
|
||||
|
||||
impl NumericalValue {
|
||||
pub fn numerical_type(&self) -> NumericalType {
|
||||
match self {
|
||||
@@ -45,7 +26,7 @@ impl NumericalValue {
|
||||
if val <= i64::MAX as u64 {
|
||||
NumericalValue::I64(val as i64)
|
||||
} else {
|
||||
NumericalValue::U64(val)
|
||||
NumericalValue::F64(val as f64)
|
||||
}
|
||||
}
|
||||
NumericalValue::I64(val) => NumericalValue::I64(val),
|
||||
@@ -160,7 +141,6 @@ impl Coerce for DateTime {
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::NumericalType;
|
||||
use crate::NumericalValue;
|
||||
|
||||
#[test]
|
||||
fn test_numerical_type_code() {
|
||||
@@ -173,58 +153,4 @@ mod tests {
|
||||
}
|
||||
assert_eq!(num_numerical_type, 3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_numerical() {
|
||||
assert_eq!(
|
||||
"123".parse::<NumericalValue>().unwrap(),
|
||||
NumericalValue::I64(123)
|
||||
);
|
||||
assert_eq!(
|
||||
"18446744073709551615".parse::<NumericalValue>().unwrap(),
|
||||
NumericalValue::U64(18446744073709551615u64)
|
||||
);
|
||||
assert_eq!(
|
||||
"1.0".parse::<NumericalValue>().unwrap(),
|
||||
NumericalValue::I64(1i64)
|
||||
);
|
||||
assert_eq!(
|
||||
"1.1".parse::<NumericalValue>().unwrap(),
|
||||
NumericalValue::F64(1.1f64)
|
||||
);
|
||||
assert_eq!(
|
||||
"-1.0".parse::<NumericalValue>().unwrap(),
|
||||
NumericalValue::I64(-1i64)
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_normalize_numerical() {
|
||||
assert_eq!(
|
||||
NumericalValue::from(1u64).normalize(),
|
||||
NumericalValue::I64(1i64),
|
||||
);
|
||||
let limit_val = i64::MAX as u64 + 1u64;
|
||||
assert_eq!(
|
||||
NumericalValue::from(limit_val).normalize(),
|
||||
NumericalValue::U64(limit_val),
|
||||
);
|
||||
assert_eq!(
|
||||
NumericalValue::from(-1i64).normalize(),
|
||||
NumericalValue::I64(-1i64),
|
||||
);
|
||||
assert_eq!(
|
||||
NumericalValue::from(-2.0f64).normalize(),
|
||||
NumericalValue::I64(-2i64),
|
||||
);
|
||||
assert_eq!(
|
||||
NumericalValue::from(-2.1f64).normalize(),
|
||||
NumericalValue::F64(-2.1f64),
|
||||
);
|
||||
let large_float = 2.0f64.powf(70.0f64);
|
||||
assert_eq!(
|
||||
NumericalValue::from(large_float).normalize(),
|
||||
NumericalValue::F64(large_float),
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy-common"
|
||||
version = "0.11.0"
|
||||
version = "0.10.0"
|
||||
authors = ["Paul Masurel <paul@quickwit.io>", "Pascal Seitz <pascal@quickwit.io>"]
|
||||
license = "MIT"
|
||||
edition = "2024"
|
||||
@@ -15,10 +15,11 @@ repository = "https://github.com/quickwit-oss/tantivy"
|
||||
byteorder = "1.4.3"
|
||||
ownedbytes = { version= "0.9", path="../ownedbytes" }
|
||||
async-trait = "0.1"
|
||||
time = { version = "0.3.47", features = ["serde-well-known"] }
|
||||
time = { version = "0.3.10", features = ["serde-well-known"] }
|
||||
serde = { version = "1.0.136", features = ["derive"] }
|
||||
|
||||
[dev-dependencies]
|
||||
binggan = "0.17.0"
|
||||
binggan = "0.14.0"
|
||||
proptest = "1.0.0"
|
||||
rand = "0.9"
|
||||
rand = "0.8.4"
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
use binggan::{BenchRunner, black_box};
|
||||
use rand::rng;
|
||||
use rand::seq::IteratorRandom;
|
||||
use rand::thread_rng;
|
||||
use tantivy_common::{BitSet, TinySet, serialize_vint_u32};
|
||||
|
||||
fn bench_vint() {
|
||||
@@ -17,7 +17,7 @@ fn bench_vint() {
|
||||
black_box(out);
|
||||
});
|
||||
|
||||
let vals: Vec<u32> = (0..20_000).choose_multiple(&mut rng(), 100_000);
|
||||
let vals: Vec<u32> = (0..20_000).choose_multiple(&mut thread_rng(), 100_000);
|
||||
runner.bench_function("bench_vint_rand", move |_| {
|
||||
let mut out = 0u64;
|
||||
for val in vals.iter().cloned() {
|
||||
|
||||
@@ -47,9 +47,6 @@ impl TinySet {
|
||||
TinySet(val)
|
||||
}
|
||||
|
||||
/// An empty `TinySet` constant.
|
||||
pub const EMPTY: TinySet = TinySet(0u64);
|
||||
|
||||
/// Returns an empty `TinySet`.
|
||||
#[inline]
|
||||
pub fn empty() -> TinySet {
|
||||
@@ -156,22 +153,7 @@ impl TinySet {
|
||||
None
|
||||
} else {
|
||||
let lowest = self.0.trailing_zeros();
|
||||
// Kernighan's trick: `n &= n - 1` clears the lowest set bit
|
||||
// without depending on `lowest`. This lets the CPU execute
|
||||
// `trailing_zeros` and the bit-clear in parallel instead of
|
||||
// serializing them.
|
||||
//
|
||||
// The previous form `self.0 ^= 1 << lowest` needs the result of
|
||||
// `trailing_zeros` before it can shift, creating a dependency chain:
|
||||
// ARM64: rbit → clz → lsl → eor
|
||||
// x86: tzcnt → btc
|
||||
//
|
||||
// With Kernighan's trick the clear path is independent of the count:
|
||||
// ARM64: sub → and (trailing_zeros runs in parallel)
|
||||
// x86: blsr (tzcnt runs in parallel)
|
||||
//
|
||||
// https://godbolt.org/z/fnfrP1T5f
|
||||
self.0 &= self.0 - 1;
|
||||
self.0 ^= TinySet::singleton(lowest).0;
|
||||
Some(lowest)
|
||||
}
|
||||
}
|
||||
@@ -199,17 +181,9 @@ pub struct BitSet {
|
||||
len: u64,
|
||||
max_value: u32,
|
||||
}
|
||||
impl std::fmt::Debug for BitSet {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.debug_struct("BitSet")
|
||||
.field("len", &self.len)
|
||||
.field("max_value", &self.max_value)
|
||||
.finish()
|
||||
}
|
||||
}
|
||||
|
||||
fn num_buckets(max_val: u32) -> u32 {
|
||||
max_val.div_ceil(64u32)
|
||||
(max_val + 63u32) / 64u32
|
||||
}
|
||||
|
||||
impl BitSet {
|
||||
@@ -434,7 +408,7 @@ mod tests {
|
||||
use std::collections::HashSet;
|
||||
|
||||
use ownedbytes::OwnedBytes;
|
||||
use rand::distr::Bernoulli;
|
||||
use rand::distributions::Bernoulli;
|
||||
use rand::rngs::StdRng;
|
||||
use rand::{Rng, SeedableRng};
|
||||
|
||||
|
||||
@@ -121,7 +121,7 @@ pub struct FileSlice {
|
||||
|
||||
impl fmt::Debug for FileSlice {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
write!(f, "FileSlice({:?}, {:?})", self.data, self.range)
|
||||
write!(f, "FileSlice({:?}, {:?})", &self.data, self.range)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -28,9 +28,7 @@ impl BinarySerializable for VIntU128 {
|
||||
writer.write_all(&buffer)
|
||||
}
|
||||
|
||||
#[allow(clippy::unbuffered_bytes)]
|
||||
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
|
||||
#[allow(clippy::unbuffered_bytes)]
|
||||
let mut bytes = reader.bytes();
|
||||
let mut result = 0u128;
|
||||
let mut shift = 0u64;
|
||||
@@ -197,9 +195,7 @@ impl BinarySerializable for VInt {
|
||||
writer.write_all(&buffer[0..num_bytes])
|
||||
}
|
||||
|
||||
#[allow(clippy::unbuffered_bytes)]
|
||||
fn deserialize<R: Read>(reader: &mut R) -> io::Result<Self> {
|
||||
#[allow(clippy::unbuffered_bytes)]
|
||||
let mut bytes = reader.bytes();
|
||||
let mut result = 0u64;
|
||||
let mut shift = 0u64;
|
||||
|
||||
@@ -62,9 +62,7 @@ impl<W: TerminatingWrite> TerminatingWrite for CountingWriter<W> {
|
||||
pub struct AntiCallToken(());
|
||||
|
||||
/// Trait used to indicate when no more write need to be done on a writer
|
||||
///
|
||||
/// Thread-safety is enforced at the call sites that require it.
|
||||
pub trait TerminatingWrite: Write {
|
||||
pub trait TerminatingWrite: Write + Send + Sync {
|
||||
/// Indicate that the writer will no longer be used. Internally call terminate_ref.
|
||||
fn terminate(mut self) -> io::Result<()>
|
||||
where Self: Sized {
|
||||
|
||||
BIN
doc/assets/images/searchbenchmark.png
Normal file
BIN
doc/assets/images/searchbenchmark.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 653 KiB |
@@ -7,6 +7,11 @@
|
||||
- [Other](#other)
|
||||
- [Usage](#usage)
|
||||
|
||||
# Index Sorting has been removed!
|
||||
More infos here:
|
||||
|
||||
https://github.com/quickwit-oss/tantivy/issues/2352
|
||||
|
||||
# Index Sorting
|
||||
|
||||
Tantivy allows you to sort the index according to a property.
|
||||
|
||||
@@ -60,7 +60,7 @@ At indexing, tantivy will try to interpret number and strings as different type
|
||||
priority order.
|
||||
|
||||
Numbers will be interpreted as u64, i64 and f64 in that order.
|
||||
Strings will be interpreted as rfc3339 dates or simple strings.
|
||||
Strings will be interpreted as rfc3999 dates or simple strings.
|
||||
|
||||
The first working type is picked and is the only term that is emitted for indexing.
|
||||
Note this interpretation happens on a per-document basis, and there is no effort to try to sniff
|
||||
@@ -81,7 +81,7 @@ Will be interpreted as
|
||||
(my_path.my_segment, String, 233) or (my_path.my_segment, u64, 233)
|
||||
```
|
||||
|
||||
Likewise, we need to emit two tokens if the query contains an rfc3339 date.
|
||||
Likewise, we need to emit two tokens if the query contains an rfc3999 date.
|
||||
Indeed the date could have been actually a single token inside the text of a document at ingestion time. Generally speaking, we will always at least emit a string token in query parsing, and sometimes more.
|
||||
|
||||
If one more json field is defined, things get even more complicated.
|
||||
|
||||
@@ -208,7 +208,7 @@ fn main() -> tantivy::Result<()> {
|
||||
// is the role of the `TopDocs` collector.
|
||||
|
||||
// We can now perform our query.
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10).order_by_score())?;
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
|
||||
|
||||
// The actual documents still need to be
|
||||
// retrieved from Tantivy's store.
|
||||
@@ -226,7 +226,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let query = query_parser.parse_query("title:sea^20 body:whale^70")?;
|
||||
|
||||
let (_score, doc_address) = searcher
|
||||
.search(&query, &TopDocs::with_limit(1).order_by_score())?
|
||||
.search(&query, &TopDocs::with_limit(1))?
|
||||
.into_iter()
|
||||
.next()
|
||||
.unwrap();
|
||||
|
||||
@@ -100,7 +100,7 @@ fn main() -> tantivy::Result<()> {
|
||||
// here we want to get a hit on the 'ken' in Frankenstein
|
||||
let query = query_parser.parse_query("ken")?;
|
||||
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10).order_by_score())?;
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
|
||||
|
||||
for (_, doc_address) in top_docs {
|
||||
let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
|
||||
|
||||
@@ -50,14 +50,14 @@ fn main() -> tantivy::Result<()> {
|
||||
{
|
||||
// Simple exact search on the date
|
||||
let query = query_parser.parse_query("occurred_at:\"2022-06-22T12:53:50.53Z\"")?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(5).order_by_score())?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(5))?;
|
||||
assert_eq!(count_docs.len(), 1);
|
||||
}
|
||||
{
|
||||
// Range query on the date field
|
||||
let query = query_parser
|
||||
.parse_query(r#"occurred_at:[2022-06-22T12:58:00Z TO 2022-06-23T00:00:00Z}"#)?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(4).order_by_score())?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(4))?;
|
||||
assert_eq!(count_docs.len(), 1);
|
||||
for (_score, doc_address) in count_docs {
|
||||
let retrieved_doc = searcher.doc::<TantivyDocument>(doc_address)?;
|
||||
|
||||
@@ -28,7 +28,7 @@ fn extract_doc_given_isbn(
|
||||
// The second argument is here to tell we don't care about decoding positions,
|
||||
// or term frequencies.
|
||||
let term_query = TermQuery::new(isbn_term.clone(), IndexRecordOption::Basic);
|
||||
let top_docs = searcher.search(&term_query, &TopDocs::with_limit(1).order_by_score())?;
|
||||
let top_docs = searcher.search(&term_query, &TopDocs::with_limit(1))?;
|
||||
|
||||
if let Some((_score, doc_address)) = top_docs.first() {
|
||||
let doc = searcher.doc(*doc_address)?;
|
||||
|
||||
@@ -1,212 +0,0 @@
|
||||
// # Filter Aggregation Example
|
||||
//
|
||||
// This example demonstrates filter aggregations - creating buckets of documents
|
||||
// matching specific queries, with nested aggregations computed on each bucket.
|
||||
//
|
||||
// Filter aggregations are useful for computing metrics on different subsets of
|
||||
// your data in a single query, like "average price overall + average price for
|
||||
// electronics + count of in-stock items".
|
||||
|
||||
use serde_json::json;
|
||||
use tantivy::aggregation::agg_req::Aggregations;
|
||||
use tantivy::aggregation::AggregationCollector;
|
||||
use tantivy::query::AllQuery;
|
||||
use tantivy::schema::{Schema, FAST, INDEXED, TEXT};
|
||||
use tantivy::{doc, Index};
|
||||
|
||||
fn main() -> tantivy::Result<()> {
|
||||
// Create a simple product schema
|
||||
let mut schema_builder = Schema::builder();
|
||||
schema_builder.add_text_field("category", TEXT | FAST);
|
||||
schema_builder.add_text_field("brand", TEXT | FAST);
|
||||
schema_builder.add_u64_field("price", FAST);
|
||||
schema_builder.add_f64_field("rating", FAST);
|
||||
schema_builder.add_bool_field("in_stock", FAST | INDEXED);
|
||||
let schema = schema_builder.build();
|
||||
|
||||
// Create index and add sample products
|
||||
let index = Index::create_in_ram(schema.clone());
|
||||
let mut writer = index.writer(50_000_000)?;
|
||||
|
||||
writer.add_document(doc!(
|
||||
schema.get_field("category")? => "electronics",
|
||||
schema.get_field("brand")? => "apple",
|
||||
schema.get_field("price")? => 999u64,
|
||||
schema.get_field("rating")? => 4.5f64,
|
||||
schema.get_field("in_stock")? => true
|
||||
))?;
|
||||
writer.add_document(doc!(
|
||||
schema.get_field("category")? => "electronics",
|
||||
schema.get_field("brand")? => "samsung",
|
||||
schema.get_field("price")? => 799u64,
|
||||
schema.get_field("rating")? => 4.2f64,
|
||||
schema.get_field("in_stock")? => true
|
||||
))?;
|
||||
writer.add_document(doc!(
|
||||
schema.get_field("category")? => "clothing",
|
||||
schema.get_field("brand")? => "nike",
|
||||
schema.get_field("price")? => 120u64,
|
||||
schema.get_field("rating")? => 4.1f64,
|
||||
schema.get_field("in_stock")? => false
|
||||
))?;
|
||||
writer.add_document(doc!(
|
||||
schema.get_field("category")? => "books",
|
||||
schema.get_field("brand")? => "penguin",
|
||||
schema.get_field("price")? => 25u64,
|
||||
schema.get_field("rating")? => 4.8f64,
|
||||
schema.get_field("in_stock")? => true
|
||||
))?;
|
||||
|
||||
writer.commit()?;
|
||||
|
||||
let reader = index.reader()?;
|
||||
let searcher = reader.searcher();
|
||||
|
||||
// Example 1: Basic filter with metric aggregation
|
||||
println!("=== Example 1: Electronics average price ===");
|
||||
let agg_req = json!({
|
||||
"electronics": {
|
||||
"filter": "category:electronics",
|
||||
"aggs": {
|
||||
"avg_price": { "avg": { "field": "price" } }
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
let agg: Aggregations = serde_json::from_value(agg_req)?;
|
||||
let collector = AggregationCollector::from_aggs(agg, Default::default());
|
||||
let result = searcher.search(&AllQuery, &collector)?;
|
||||
|
||||
let expected = json!({
|
||||
"electronics": {
|
||||
"doc_count": 2,
|
||||
"avg_price": { "value": 899.0 }
|
||||
}
|
||||
});
|
||||
assert_eq!(serde_json::to_value(&result)?, expected);
|
||||
println!("{}\n", serde_json::to_string_pretty(&result)?);
|
||||
|
||||
// Example 2: Multiple independent filters
|
||||
println!("=== Example 2: Multiple filters in one query ===");
|
||||
let agg_req = json!({
|
||||
"electronics": {
|
||||
"filter": "category:electronics",
|
||||
"aggs": { "avg_price": { "avg": { "field": "price" } } }
|
||||
},
|
||||
"in_stock": {
|
||||
"filter": "in_stock:true",
|
||||
"aggs": { "count": { "value_count": { "field": "brand" } } }
|
||||
},
|
||||
"high_rated": {
|
||||
"filter": "rating:[4.5 TO *]",
|
||||
"aggs": { "count": { "value_count": { "field": "brand" } } }
|
||||
}
|
||||
});
|
||||
|
||||
let agg: Aggregations = serde_json::from_value(agg_req)?;
|
||||
let collector = AggregationCollector::from_aggs(agg, Default::default());
|
||||
let result = searcher.search(&AllQuery, &collector)?;
|
||||
|
||||
let expected = json!({
|
||||
"electronics": {
|
||||
"doc_count": 2,
|
||||
"avg_price": { "value": 899.0 }
|
||||
},
|
||||
"in_stock": {
|
||||
"doc_count": 3,
|
||||
"count": { "value": 3.0 }
|
||||
},
|
||||
"high_rated": {
|
||||
"doc_count": 2,
|
||||
"count": { "value": 2.0 }
|
||||
}
|
||||
});
|
||||
assert_eq!(serde_json::to_value(&result)?, expected);
|
||||
println!("{}\n", serde_json::to_string_pretty(&result)?);
|
||||
|
||||
// Example 3: Nested filters - progressive refinement
|
||||
println!("=== Example 3: Nested filters ===");
|
||||
let agg_req = json!({
|
||||
"in_stock": {
|
||||
"filter": "in_stock:true",
|
||||
"aggs": {
|
||||
"electronics": {
|
||||
"filter": "category:electronics",
|
||||
"aggs": {
|
||||
"expensive": {
|
||||
"filter": "price:[800 TO *]",
|
||||
"aggs": {
|
||||
"avg_rating": { "avg": { "field": "rating" } }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
let agg: Aggregations = serde_json::from_value(agg_req)?;
|
||||
let collector = AggregationCollector::from_aggs(agg, Default::default());
|
||||
let result = searcher.search(&AllQuery, &collector)?;
|
||||
|
||||
let expected = json!({
|
||||
"in_stock": {
|
||||
"doc_count": 3, // apple, samsung, penguin
|
||||
"electronics": {
|
||||
"doc_count": 2, // apple, samsung
|
||||
"expensive": {
|
||||
"doc_count": 1, // only apple (999)
|
||||
"avg_rating": { "value": 4.5 }
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
assert_eq!(serde_json::to_value(&result)?, expected);
|
||||
println!("{}\n", serde_json::to_string_pretty(&result)?);
|
||||
|
||||
// Example 4: Filter with sub-aggregation (terms)
|
||||
println!("=== Example 4: Filter with terms sub-aggregation ===");
|
||||
let agg_req = json!({
|
||||
"electronics": {
|
||||
"filter": "category:electronics",
|
||||
"aggs": {
|
||||
"by_brand": {
|
||||
"terms": { "field": "brand" },
|
||||
"aggs": {
|
||||
"avg_price": { "avg": { "field": "price" } }
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
let agg: Aggregations = serde_json::from_value(agg_req)?;
|
||||
let collector = AggregationCollector::from_aggs(agg, Default::default());
|
||||
let result = searcher.search(&AllQuery, &collector)?;
|
||||
|
||||
let expected = json!({
|
||||
"electronics": {
|
||||
"doc_count": 2,
|
||||
"by_brand": {
|
||||
"buckets": [
|
||||
{
|
||||
"key": "samsung",
|
||||
"doc_count": 1,
|
||||
"avg_price": { "value": 799.0 }
|
||||
},
|
||||
{
|
||||
"key": "apple",
|
||||
"doc_count": 1,
|
||||
"avg_price": { "value": 999.0 }
|
||||
}
|
||||
],
|
||||
"sum_other_doc_count": 0,
|
||||
"doc_count_error_upper_bound": 0
|
||||
}
|
||||
}
|
||||
});
|
||||
assert_eq!(serde_json::to_value(&result)?, expected);
|
||||
println!("{}", serde_json::to_string_pretty(&result)?);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
@@ -85,6 +85,7 @@ fn main() -> tantivy::Result<()> {
|
||||
index_writer.add_document(doc!(
|
||||
title => "The Diary of a Young Girl",
|
||||
))?;
|
||||
index_writer.commit()?;
|
||||
|
||||
// ### Committing
|
||||
//
|
||||
@@ -145,7 +146,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let query = FuzzyTermQuery::new(term, 2, true);
|
||||
|
||||
let (top_docs, count) = searcher
|
||||
.search(&query, &(TopDocs::with_limit(5).order_by_score(), Count))
|
||||
.search(&query, &(TopDocs::with_limit(5), Count))
|
||||
.unwrap();
|
||||
assert_eq!(count, 3);
|
||||
assert_eq!(top_docs.len(), 3);
|
||||
|
||||
@@ -69,25 +69,25 @@ fn main() -> tantivy::Result<()> {
|
||||
{
|
||||
// Inclusive range queries
|
||||
let query = query_parser.parse_query("ip:[192.168.0.80 TO 192.168.0.100]")?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(5).order_by_score())?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(5))?;
|
||||
assert_eq!(count_docs.len(), 1);
|
||||
}
|
||||
{
|
||||
// Exclusive range queries
|
||||
let query = query_parser.parse_query("ip:{192.168.0.80 TO 192.168.1.100]")?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(count_docs.len(), 0);
|
||||
}
|
||||
{
|
||||
// Find docs with IP addresses smaller equal 192.168.1.100
|
||||
let query = query_parser.parse_query("ip:[* TO 192.168.1.100]")?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(count_docs.len(), 2);
|
||||
}
|
||||
{
|
||||
// Find docs with IP addresses smaller than 192.168.1.100
|
||||
let query = query_parser.parse_query("ip:[* TO 192.168.1.100}")?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(count_docs.len(), 2);
|
||||
}
|
||||
|
||||
|
||||
@@ -59,12 +59,12 @@ fn main() -> tantivy::Result<()> {
|
||||
let query_parser = QueryParser::for_index(&index, vec![event_type, attributes]);
|
||||
{
|
||||
let query = query_parser.parse_query("target:submit-button")?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(count_docs.len(), 2);
|
||||
}
|
||||
{
|
||||
let query = query_parser.parse_query("target:submit")?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(count_docs.len(), 2);
|
||||
}
|
||||
{
|
||||
@@ -74,33 +74,33 @@ fn main() -> tantivy::Result<()> {
|
||||
}
|
||||
{
|
||||
let query = query_parser.parse_query("click AND cart.product_id:133")?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(hits.len(), 1);
|
||||
}
|
||||
{
|
||||
// The sub-fields in the json field marked as default field still need to be explicitly
|
||||
// addressed
|
||||
let query = query_parser.parse_query("click AND 133")?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(hits.len(), 0);
|
||||
}
|
||||
{
|
||||
// Default json fields are ignored if they collide with the schema
|
||||
let query = query_parser.parse_query("event_type:holiday-sale")?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(hits.len(), 0);
|
||||
}
|
||||
// # Query via full attribute path
|
||||
{
|
||||
// This only searches in our schema's `event_type` field
|
||||
let query = query_parser.parse_query("event_type:click")?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(hits.len(), 2);
|
||||
}
|
||||
{
|
||||
// Default json fields can still be accessed by full path
|
||||
let query = query_parser.parse_query("attributes.event_type:holiday-sale")?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2).order_by_score())?;
|
||||
let hits = searcher.search(&*query, &TopDocs::with_limit(2))?;
|
||||
assert_eq!(hits.len(), 1);
|
||||
}
|
||||
Ok(())
|
||||
|
||||
@@ -63,7 +63,7 @@ fn main() -> Result<()> {
|
||||
// but not "in the Gulf Stream".
|
||||
let query = query_parser.parse_query("\"in the su\"*")?;
|
||||
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10).order_by_score())?;
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
|
||||
let mut titles = top_docs
|
||||
.into_iter()
|
||||
.map(|(_score, doc_address)| {
|
||||
|
||||
@@ -107,8 +107,7 @@ fn main() -> tantivy::Result<()> {
|
||||
IndexRecordOption::Basic,
|
||||
);
|
||||
|
||||
let (top_docs, count) =
|
||||
searcher.search(&query, &(TopDocs::with_limit(2).order_by_score(), Count))?;
|
||||
let (top_docs, count) = searcher.search(&query, &(TopDocs::with_limit(2), Count))?;
|
||||
|
||||
assert_eq!(count, 2);
|
||||
|
||||
@@ -129,8 +128,7 @@ fn main() -> tantivy::Result<()> {
|
||||
IndexRecordOption::Basic,
|
||||
);
|
||||
|
||||
let (_top_docs, count) =
|
||||
searcher.search(&query, &(TopDocs::with_limit(2).order_by_score(), Count))?;
|
||||
let (_top_docs, count) = searcher.search(&query, &(TopDocs::with_limit(2), Count))?;
|
||||
|
||||
assert_eq!(count, 0);
|
||||
|
||||
|
||||
@@ -50,7 +50,7 @@ fn main() -> tantivy::Result<()> {
|
||||
let query_parser = QueryParser::for_index(&index, vec![title, body]);
|
||||
let query = query_parser.parse_query("sycamore spring")?;
|
||||
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10).order_by_score())?;
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
|
||||
|
||||
let snippet_generator = SnippetGenerator::create(&searcher, &*query, body)?;
|
||||
|
||||
|
||||
@@ -102,7 +102,7 @@ fn main() -> tantivy::Result<()> {
|
||||
// stop words are applied on the query as well.
|
||||
// The following will be equivalent to `title:frankenstein`
|
||||
let query = query_parser.parse_query("title:\"the Frankenstein\"")?;
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10).order_by_score())?;
|
||||
let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
|
||||
|
||||
for (score, doc_address) in top_docs {
|
||||
let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
|
||||
|
||||
@@ -164,7 +164,7 @@ fn main() -> tantivy::Result<()> {
|
||||
move |doc_id: DocId| Reverse(price[doc_id as usize])
|
||||
};
|
||||
|
||||
let most_expensive_first = TopDocs::with_limit(10).order_by(score_by_price);
|
||||
let most_expensive_first = TopDocs::with_limit(10).custom_score(score_by_price);
|
||||
|
||||
let hits = searcher.search(&query, &most_expensive_first)?;
|
||||
assert_eq!(
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "tantivy-query-grammar"
|
||||
version = "0.26.0"
|
||||
version = "0.25.0"
|
||||
authors = ["Paul Masurel <paul.masurel@gmail.com>"]
|
||||
license = "MIT"
|
||||
categories = ["database-implementations", "data-structures"]
|
||||
@@ -15,5 +15,3 @@ edition = "2024"
|
||||
nom = "7"
|
||||
serde = { version = "1.0.219", features = ["derive"] }
|
||||
serde_json = "1.0.140"
|
||||
ordered-float = "5.0.0"
|
||||
fnv = "1.0.7"
|
||||
|
||||
@@ -117,22 +117,6 @@ where F: nom::Parser<I, (O, ErrorList), Infallible> {
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn terminated_infallible<I, O1, O2, F, G>(
|
||||
mut first: F,
|
||||
mut second: G,
|
||||
) -> impl FnMut(I) -> JResult<I, O1>
|
||||
where
|
||||
F: nom::Parser<I, (O1, ErrorList), Infallible>,
|
||||
G: nom::Parser<I, (O2, ErrorList), Infallible>,
|
||||
{
|
||||
move |input: I| {
|
||||
let (input, (o1, mut err)) = first.parse(input)?;
|
||||
let (input, (_, mut err2)) = second.parse(input)?;
|
||||
err.append(&mut err2);
|
||||
Ok((input, (o1, err)))
|
||||
}
|
||||
}
|
||||
|
||||
pub(crate) fn delimited_infallible<I, O1, O2, O3, F, G, H>(
|
||||
mut first: F,
|
||||
mut second: G,
|
||||
|
||||
@@ -31,17 +31,7 @@ pub fn parse_query_lenient(query: &str) -> (UserInputAst, Vec<LenientError>) {
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use crate::{UserInputAst, parse_query, parse_query_lenient};
|
||||
|
||||
#[test]
|
||||
fn test_deduplication() {
|
||||
let ast: UserInputAst = parse_query("a a").unwrap();
|
||||
let json = serde_json::to_string(&ast).unwrap();
|
||||
assert_eq!(
|
||||
json,
|
||||
r#"{"type":"bool","clauses":[[null,{"type":"literal","field_name":null,"phrase":"a","delimiter":"none","slop":0,"prefix":false}]]}"#
|
||||
);
|
||||
}
|
||||
use crate::{parse_query, parse_query_lenient};
|
||||
|
||||
#[test]
|
||||
fn test_parse_query_serialization() {
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
use std::borrow::Cow;
|
||||
use std::iter::once;
|
||||
|
||||
use fnv::FnvHashSet;
|
||||
use nom::IResult;
|
||||
use nom::branch::alt;
|
||||
use nom::bytes::complete::tag;
|
||||
@@ -69,7 +68,7 @@ fn interpret_escape(source: &str) -> String {
|
||||
|
||||
/// Consume a word outside of any context.
|
||||
// TODO should support escape sequences
|
||||
fn word(inp: &str) -> IResult<&str, Cow<'_, str>> {
|
||||
fn word(inp: &str) -> IResult<&str, Cow<str>> {
|
||||
map_res(
|
||||
recognize(tuple((
|
||||
alt((
|
||||
@@ -306,14 +305,15 @@ fn term_group_infallible(inp: &str) -> JResult<&str, UserInputAst> {
|
||||
let (inp, (field_name, _, _, _)) =
|
||||
tuple((field_name, multispace0, char('('), multispace0))(inp).expect("precondition failed");
|
||||
|
||||
delimited_infallible(
|
||||
let res = delimited_infallible(
|
||||
nothing,
|
||||
map(ast_infallible, |(mut ast, errors)| {
|
||||
ast.set_default_field(field_name.to_string());
|
||||
(ast, errors)
|
||||
}),
|
||||
opt_i_err(char(')'), "expected ')'"),
|
||||
)(inp)
|
||||
)(inp);
|
||||
res
|
||||
}
|
||||
|
||||
fn exists(inp: &str) -> IResult<&str, UserInputLeaf> {
|
||||
@@ -327,9 +327,7 @@ fn exists(inp: &str) -> IResult<&str, UserInputLeaf> {
|
||||
peek(alt((
|
||||
value(
|
||||
"",
|
||||
satisfy(|c: char| {
|
||||
c.is_whitespace() || (ESCAPE_IN_WORD.contains(&c) && c != '\\')
|
||||
}),
|
||||
satisfy(|c: char| c.is_whitespace() || ESCAPE_IN_WORD.contains(&c)),
|
||||
),
|
||||
eof,
|
||||
))),
|
||||
@@ -347,9 +345,7 @@ fn exists_precond(inp: &str) -> IResult<&str, (), ()> {
|
||||
peek(alt((
|
||||
value(
|
||||
"",
|
||||
satisfy(|c: char| {
|
||||
c.is_whitespace() || (ESCAPE_IN_WORD.contains(&c) && c != '\\')
|
||||
}),
|
||||
satisfy(|c: char| c.is_whitespace() || ESCAPE_IN_WORD.contains(&c)),
|
||||
),
|
||||
eof,
|
||||
))), // we need to check this isn't a wildcard query
|
||||
@@ -371,10 +367,7 @@ fn literal(inp: &str) -> IResult<&str, UserInputAst> {
|
||||
// something (a field name) got parsed before
|
||||
alt((
|
||||
map(
|
||||
tuple((
|
||||
opt(field_name),
|
||||
alt((range, set, exists, regex, term_or_phrase)),
|
||||
)),
|
||||
tuple((opt(field_name), alt((range, set, exists, term_or_phrase)))),
|
||||
|(field_name, leaf): (Option<String>, UserInputLeaf)| leaf.set_field(field_name).into(),
|
||||
),
|
||||
term_group,
|
||||
@@ -396,10 +389,6 @@ fn literal_no_group_infallible(inp: &str) -> JResult<&str, Option<UserInputAst>>
|
||||
value((), peek(one_of("{[><"))),
|
||||
map(range_infallible, |(range, errs)| (Some(range), errs)),
|
||||
),
|
||||
(
|
||||
value((), peek(one_of("/"))),
|
||||
map(regex_infallible, |(regex, errs)| (Some(regex), errs)),
|
||||
),
|
||||
),
|
||||
delimited_infallible(space0_infallible, term_or_phrase_infallible, nothing),
|
||||
),
|
||||
@@ -564,7 +553,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
(
|
||||
(
|
||||
value((), tag(">=")),
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
bound
|
||||
@@ -578,7 +567,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
),
|
||||
(
|
||||
value((), tag("<=")),
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
UserInputBound::Unbounded,
|
||||
@@ -592,7 +581,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
),
|
||||
(
|
||||
value((), tag(">")),
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
bound
|
||||
@@ -606,7 +595,7 @@ fn range_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
),
|
||||
(
|
||||
value((), tag("<")),
|
||||
map(word_infallible(")", false), |(bound, err)| {
|
||||
map(word_infallible("", false), |(bound, err)| {
|
||||
(
|
||||
(
|
||||
UserInputBound::Unbounded,
|
||||
@@ -700,71 +689,6 @@ fn set_infallible(mut inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
}
|
||||
}
|
||||
|
||||
fn regex(inp: &str) -> IResult<&str, UserInputLeaf> {
|
||||
map(
|
||||
terminated(
|
||||
delimited(
|
||||
char('/'),
|
||||
many1(alt((preceded(char('\\'), char('/')), none_of("/")))),
|
||||
char('/'),
|
||||
),
|
||||
peek(alt((
|
||||
value((), multispace1),
|
||||
value((), char(')')),
|
||||
value((), char('^')),
|
||||
value((), eof),
|
||||
))),
|
||||
),
|
||||
|elements| UserInputLeaf::Regex {
|
||||
field: None,
|
||||
pattern: elements.into_iter().collect::<String>(),
|
||||
},
|
||||
)(inp)
|
||||
}
|
||||
|
||||
fn regex_infallible(inp: &str) -> JResult<&str, UserInputLeaf> {
|
||||
match terminated_infallible(
|
||||
delimited_infallible(
|
||||
opt_i_err(char('/'), "missing delimiter /"),
|
||||
opt_i(many1(alt((preceded(char('\\'), char('/')), none_of("/"))))),
|
||||
opt_i_err(char('/'), "missing delimiter /"),
|
||||
),
|
||||
opt_i_err(
|
||||
peek(alt((
|
||||
value((), multispace1),
|
||||
value((), char(')')),
|
||||
value((), char('^')),
|
||||
value((), eof),
|
||||
))),
|
||||
"expected whitespace, closing parenthesis, boost, or end of input",
|
||||
),
|
||||
)(inp)
|
||||
{
|
||||
Ok((rest, (elements_part, errors))) => {
|
||||
let pattern = match elements_part {
|
||||
Some(elements_part) => elements_part.into_iter().collect(),
|
||||
None => String::new(),
|
||||
};
|
||||
let res = UserInputLeaf::Regex {
|
||||
field: None,
|
||||
pattern,
|
||||
};
|
||||
Ok((rest, (res, errors)))
|
||||
}
|
||||
Err(e) => {
|
||||
let errs = vec![LenientErrorInternal {
|
||||
pos: inp.len(),
|
||||
message: e.to_string(),
|
||||
}];
|
||||
let res = UserInputLeaf::Regex {
|
||||
field: None,
|
||||
pattern: String::new(),
|
||||
};
|
||||
Ok((inp, (res, errs)))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn negate(expr: UserInputAst) -> UserInputAst {
|
||||
expr.unary(Occur::MustNot)
|
||||
}
|
||||
@@ -772,21 +696,7 @@ fn negate(expr: UserInputAst) -> UserInputAst {
|
||||
fn leaf(inp: &str) -> IResult<&str, UserInputAst> {
|
||||
alt((
|
||||
delimited(char('('), ast, char(')')),
|
||||
map(
|
||||
terminated(
|
||||
char('*'),
|
||||
peek(alt((
|
||||
value((), multispace1),
|
||||
value((), char(')')),
|
||||
value((), eof),
|
||||
value(
|
||||
(),
|
||||
satisfy(|c: char| ESCAPE_IN_WORD.contains(&c) && c != '\\'),
|
||||
),
|
||||
))),
|
||||
),
|
||||
|_| UserInputAst::from(UserInputLeaf::All),
|
||||
),
|
||||
map(char('*'), |_| UserInputAst::from(UserInputLeaf::All)),
|
||||
map(preceded(tuple((tag("NOT"), multispace1)), leaf), negate),
|
||||
literal,
|
||||
))(inp)
|
||||
@@ -807,21 +717,7 @@ fn leaf_infallible(inp: &str) -> JResult<&str, Option<UserInputAst>> {
|
||||
),
|
||||
),
|
||||
(
|
||||
value(
|
||||
(),
|
||||
terminated(
|
||||
char('*'),
|
||||
peek(alt((
|
||||
value((), multispace1),
|
||||
value((), char(')')),
|
||||
value((), eof),
|
||||
value(
|
||||
(),
|
||||
satisfy(|c: char| ESCAPE_IN_WORD.contains(&c) && c != '\\'),
|
||||
),
|
||||
))),
|
||||
),
|
||||
),
|
||||
value((), char('*')),
|
||||
map(nothing, |_| {
|
||||
(Some(UserInputAst::from(UserInputLeaf::All)), Vec::new())
|
||||
}),
|
||||
@@ -857,7 +753,7 @@ fn boosted_leaf(inp: &str) -> IResult<&str, UserInputAst> {
|
||||
tuple((leaf, fallible(boost))),
|
||||
|(leaf, boost_opt)| match boost_opt {
|
||||
Some(boost) if (boost - 1.0).abs() > f64::EPSILON => {
|
||||
UserInputAst::Boost(Box::new(leaf), boost.into())
|
||||
UserInputAst::Boost(Box::new(leaf), boost)
|
||||
}
|
||||
_ => leaf,
|
||||
},
|
||||
@@ -869,7 +765,7 @@ fn boosted_leaf_infallible(inp: &str) -> JResult<&str, Option<UserInputAst>> {
|
||||
tuple_infallible((leaf_infallible, boost)),
|
||||
|((leaf, boost_opt), error)| match boost_opt {
|
||||
Some(boost) if (boost - 1.0).abs() > f64::EPSILON => (
|
||||
leaf.map(|leaf| UserInputAst::Boost(Box::new(leaf), boost.into())),
|
||||
leaf.map(|leaf| UserInputAst::Boost(Box::new(leaf), boost)),
|
||||
error,
|
||||
),
|
||||
_ => (leaf, error),
|
||||
@@ -1059,43 +955,18 @@ fn operand_leaf(inp: &str) -> IResult<&str, (Option<BinaryOperand>, Option<Occur
|
||||
}
|
||||
|
||||
fn ast(inp: &str) -> IResult<&str, UserInputAst> {
|
||||
// Parse `occur_leaf` once, then conditionally extend into a boolean
|
||||
// expression. The previous implementation used `alt((boolean_expr,
|
||||
// single_leaf))` which, when the input was a single leaf with no
|
||||
// following operand, would parse `occur_leaf` once for `boolean_expr`,
|
||||
// fail at `multispace1`, backtrack, then re-parse `occur_leaf` for
|
||||
// `single_leaf`. With recursively-nested groups like `(+(+(+a)))`, that
|
||||
// doubling at every level produced O(2^n) parse time. Parsing once and
|
||||
// peeking ahead for the operand keeps it O(n).
|
||||
delimited(
|
||||
multispace0,
|
||||
|inp| {
|
||||
let (rest, first) = occur_leaf(inp)?;
|
||||
// Only fall back on `Err::Error` (recoverable), mirroring
|
||||
// `alt`'s behaviour. `Err::Failure` and `Err::Incomplete`
|
||||
// must propagate so cut points and streaming needs are not
|
||||
// accidentally swallowed if they are ever introduced in the
|
||||
// operand parsers.
|
||||
match preceded(multispace1, many1(operand_leaf))(rest) {
|
||||
Ok((rest, more)) => {
|
||||
let combined = aggregate_binary_expressions(first, more)
|
||||
.map_err(|_| nom::Err::Error(Error::new(inp, ErrorKind::MapRes)))?;
|
||||
Ok((rest, combined))
|
||||
}
|
||||
Err(nom::Err::Error(_)) => {
|
||||
let (occur, ast) = first;
|
||||
let single = if occur == Some(Occur::MustNot) {
|
||||
ast.unary(Occur::MustNot)
|
||||
} else {
|
||||
ast
|
||||
};
|
||||
Ok((rest, single))
|
||||
}
|
||||
Err(e) => Err(e),
|
||||
}
|
||||
},
|
||||
multispace0,
|
||||
)(inp)
|
||||
let boolean_expr = map_res(
|
||||
separated_pair(occur_leaf, multispace1, many1(operand_leaf)),
|
||||
|(left, right)| aggregate_binary_expressions(left, right),
|
||||
);
|
||||
let single_leaf = map(occur_leaf, |(occur, ast)| {
|
||||
if occur == Some(Occur::MustNot) {
|
||||
ast.unary(Occur::MustNot)
|
||||
} else {
|
||||
ast
|
||||
}
|
||||
});
|
||||
delimited(multispace0, alt((boolean_expr, single_leaf)), multispace0)(inp)
|
||||
}
|
||||
|
||||
fn ast_infallible(inp: &str) -> JResult<&str, UserInputAst> {
|
||||
@@ -1145,25 +1016,12 @@ pub fn parse_to_ast_lenient(query_str: &str) -> (UserInputAst, Vec<LenientError>
|
||||
(rewrite_ast(res), errors)
|
||||
}
|
||||
|
||||
/// Removes unnecessary children clauses in AST
|
||||
///
|
||||
/// Motivated by [issue #1433](https://github.com/quickwit-oss/tantivy/issues/1433)
|
||||
fn rewrite_ast(mut input: UserInputAst) -> UserInputAst {
|
||||
if let UserInputAst::Clause(sub_clauses) = &mut input {
|
||||
// call rewrite_ast recursively on children clauses if applicable
|
||||
let mut new_clauses = Vec::with_capacity(sub_clauses.len());
|
||||
for (occur, clause) in sub_clauses.drain(..) {
|
||||
let rewritten_clause = rewrite_ast(clause);
|
||||
new_clauses.push((occur, rewritten_clause));
|
||||
}
|
||||
*sub_clauses = new_clauses;
|
||||
|
||||
// remove duplicate child clauses
|
||||
// e.g. (+a +b) OR (+c +d) OR (+a +b) => (+a +b) OR (+c +d)
|
||||
let mut seen = FnvHashSet::default();
|
||||
sub_clauses.retain(|term| seen.insert(term.clone()));
|
||||
|
||||
// Removes unnecessary children clauses in AST
|
||||
//
|
||||
// Motivated by [issue #1433](https://github.com/quickwit-oss/tantivy/issues/1433)
|
||||
for term in sub_clauses {
|
||||
if let UserInputAst::Clause(terms) = &mut input {
|
||||
for term in terms {
|
||||
rewrite_ast_clause(term);
|
||||
}
|
||||
}
|
||||
@@ -1370,14 +1228,6 @@ mod test {
|
||||
test_parse_query_to_ast_helper("<a", "{\"*\" TO \"a\"}");
|
||||
test_parse_query_to_ast_helper("<=a", "{\"*\" TO \"a\"]");
|
||||
test_parse_query_to_ast_helper("<=bsd", "{\"*\" TO \"bsd\"]");
|
||||
|
||||
test_parse_query_to_ast_helper("(<=42)", "{\"*\" TO \"42\"]");
|
||||
test_parse_query_to_ast_helper("(<=42 )", "{\"*\" TO \"42\"]");
|
||||
test_parse_query_to_ast_helper("(age:>5)", "\"age\":{\"5\" TO \"*\"}");
|
||||
test_parse_query_to_ast_helper(
|
||||
"(title:bar AND age:>12)",
|
||||
"(+\"title\":bar +\"age\":{\"12\" TO \"*\"})",
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -1746,27 +1596,6 @@ mod test {
|
||||
test_parse_query_to_ast_helper("abc:a b", "(*\"abc\":a *b)");
|
||||
test_parse_query_to_ast_helper("abc:\"a b\"", "\"abc\":\"a b\"");
|
||||
test_parse_query_to_ast_helper("foo:[1 TO 5]", "\"foo\":[\"1\" TO \"5\"]");
|
||||
|
||||
// Phrase prefixed with *
|
||||
test_parse_query_to_ast_helper("foo:(*A)", "\"foo\":*A");
|
||||
test_parse_query_to_ast_helper("*A", "*A");
|
||||
test_parse_query_to_ast_helper("(*A)", "*A");
|
||||
test_parse_query_to_ast_helper("foo:(A OR B)", "(?\"foo\":A ?\"foo\":B)");
|
||||
test_parse_query_to_ast_helper("foo:(A* OR B*)", "(?\"foo\":A* ?\"foo\":B*)");
|
||||
test_parse_query_to_ast_helper("foo:(*A OR *B)", "(?\"foo\":*A ?\"foo\":*B)");
|
||||
|
||||
// Regexes between parentheses
|
||||
test_parse_query_to_ast_helper("foo:(/A.*/)", "\"foo\":/A.*/");
|
||||
test_parse_query_to_ast_helper("foo:(/A.*/ OR /B.*/)", "(?\"foo\":/A.*/ ?\"foo\":/B.*/)");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_parse_query_all() {
|
||||
test_parse_query_to_ast_helper("*", "*");
|
||||
test_parse_query_to_ast_helper("(*)", "*");
|
||||
test_parse_query_to_ast_helper("(* )", "*");
|
||||
// All query with boost
|
||||
test_parse_query_to_ast_helper("*^2", "(*)^2");
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -1829,7 +1658,6 @@ mod test {
|
||||
test_parse_query_to_ast_helper("a:b*", "\"a\":b*");
|
||||
test_parse_query_to_ast_helper("a:*b", "\"a\":*b");
|
||||
test_parse_query_to_ast_helper(r#"a:*def*"#, "\"a\":*def*");
|
||||
test_parse_query_to_ast_helper("a:*\\:foo", "\"a\":*:foo");
|
||||
}
|
||||
|
||||
#[test]
|
||||
@@ -1866,65 +1694,6 @@ mod test {
|
||||
test_is_parse_err(r#"!bc:def"#, "!bc:def");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_regex_parser() {
|
||||
let r = parse_to_ast(r#"a:/joh?n(ath[oa]n)/"#);
|
||||
assert!(r.is_ok(), "Failed to parse custom query: {r:?}");
|
||||
let (_, input) = r.unwrap();
|
||||
match input {
|
||||
UserInputAst::Leaf(leaf) => match leaf.as_ref() {
|
||||
UserInputLeaf::Regex { field, pattern } => {
|
||||
assert_eq!(field, &Some("a".to_string()));
|
||||
assert_eq!(pattern, "joh?n(ath[oa]n)");
|
||||
}
|
||||
_ => panic!("Expected a regex leaf, got {leaf:?}"),
|
||||
},
|
||||
_ => panic!("Expected a leaf"),
|
||||
}
|
||||
let r = parse_to_ast(r#"a:/\\/cgi-bin\\/luci.*/"#);
|
||||
assert!(r.is_ok(), "Failed to parse custom query: {r:?}");
|
||||
let (_, input) = r.unwrap();
|
||||
match input {
|
||||
UserInputAst::Leaf(leaf) => match leaf.as_ref() {
|
||||
UserInputLeaf::Regex { field, pattern } => {
|
||||
assert_eq!(field, &Some("a".to_string()));
|
||||
assert_eq!(pattern, "\\/cgi-bin\\/luci.*");
|
||||
}
|
||||
_ => panic!("Expected a regex leaf, got {leaf:?}"),
|
||||
},
|
||||
_ => panic!("Expected a leaf"),
|
||||
}
|
||||
// Regex followed by `^boost`
|
||||
test_parse_query_to_ast_helper(r#"foo:/bar/^2"#, r#"("foo":/bar/)^2"#);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_regex_parser_lenient() {
|
||||
let literal = |query| literal_infallible(query).unwrap().1;
|
||||
|
||||
let (res, errs) = literal(r#"a:/joh?n(ath[oa]n)/"#);
|
||||
let expected = UserInputLeaf::Regex {
|
||||
field: Some("a".to_string()),
|
||||
pattern: "joh?n(ath[oa]n)".to_string(),
|
||||
}
|
||||
.into();
|
||||
assert_eq!(res.unwrap(), expected);
|
||||
assert!(errs.is_empty(), "Expected no errors, got: {errs:?}");
|
||||
|
||||
let (res, errs) = literal("title:/joh?n(ath[oa]n)");
|
||||
let expected = UserInputLeaf::Regex {
|
||||
field: Some("title".to_string()),
|
||||
pattern: "joh?n(ath[oa]n)".to_string(),
|
||||
}
|
||||
.into();
|
||||
assert_eq!(res.unwrap(), expected);
|
||||
assert_eq!(errs.len(), 1, "Expected 1 error, got: {errs:?}");
|
||||
assert_eq!(
|
||||
errs[0].message, "missing delimiter /",
|
||||
"Unexpected error message",
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_space_before_value() {
|
||||
test_parse_query_to_ast_helper("field : a", r#""field":a"#);
|
||||
@@ -1935,23 +1704,4 @@ mod test {
|
||||
r#"(+"field":'happy tax payer' +"other_field":1)"#,
|
||||
);
|
||||
}
|
||||
|
||||
// Regression test for https://github.com/quickwit-oss/tantivy/issues/2498:
|
||||
// deeply nested parenthesized queries used to take O(2^n) time because the
|
||||
// top-level `ast()` parser tried `boolean_expr` first and re-parsed the
|
||||
// inner `occur_leaf` when it backtracked to `single_leaf`. Depth 60 would
|
||||
// take ~10^18 operations under the regression; with the fix it parses
|
||||
// instantly. We use `test_parse_query_to_ast_helper` so this test would
|
||||
// never finish if the regression returned.
|
||||
#[test]
|
||||
fn test_parse_deeply_nested_query() {
|
||||
let depth = 60;
|
||||
let leading: String = "(".repeat(depth);
|
||||
let trailing: String = ")".repeat(depth);
|
||||
let query = format!("{leading}title:test{trailing}");
|
||||
test_parse_query_to_ast_helper(&query, r#""title":test"#);
|
||||
|
||||
let query_with_plus = format!("+{leading}title:test{trailing}");
|
||||
test_parse_query_to_ast_helper(&query_with_plus, r#""title":test"#);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -5,7 +5,7 @@ use serde::Serialize;
|
||||
|
||||
use crate::Occur;
|
||||
|
||||
#[derive(PartialEq, Eq, Hash, Clone, Serialize)]
|
||||
#[derive(PartialEq, Clone, Serialize)]
|
||||
#[serde(tag = "type")]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
pub enum UserInputLeaf {
|
||||
@@ -23,10 +23,6 @@ pub enum UserInputLeaf {
|
||||
Exists {
|
||||
field: String,
|
||||
},
|
||||
Regex {
|
||||
field: Option<String>,
|
||||
pattern: String,
|
||||
},
|
||||
}
|
||||
|
||||
impl UserInputLeaf {
|
||||
@@ -50,7 +46,6 @@ impl UserInputLeaf {
|
||||
UserInputLeaf::Exists { field: _ } => UserInputLeaf::Exists {
|
||||
field: field.expect("Exist query without a field isn't allowed"),
|
||||
},
|
||||
UserInputLeaf::Regex { field: _, pattern } => UserInputLeaf::Regex { field, pattern },
|
||||
}
|
||||
}
|
||||
|
||||
@@ -66,7 +61,6 @@ impl UserInputLeaf {
|
||||
}
|
||||
UserInputLeaf::Range { field, .. } if field.is_none() => *field = Some(default_field),
|
||||
UserInputLeaf::Set { field, .. } if field.is_none() => *field = Some(default_field),
|
||||
UserInputLeaf::Regex { field, .. } if field.is_none() => *field = Some(default_field),
|
||||
_ => (), // field was already set, do nothing
|
||||
}
|
||||
}
|
||||
@@ -109,19 +103,11 @@ impl Debug for UserInputLeaf {
|
||||
UserInputLeaf::Exists { field } => {
|
||||
write!(formatter, "$exists(\"{field}\")")
|
||||
}
|
||||
UserInputLeaf::Regex { field, pattern } => {
|
||||
if let Some(field) = field {
|
||||
// TODO properly escape field (in case of \")
|
||||
write!(formatter, "\"{field}\":")?;
|
||||
}
|
||||
// TODO properly escape pattern (in case of \")
|
||||
write!(formatter, "/{pattern}/")
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Copy, Clone, Eq, PartialEq, Hash, Debug, Serialize)]
|
||||
#[derive(Copy, Clone, Eq, PartialEq, Debug, Serialize)]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
pub enum Delimiter {
|
||||
SingleQuotes,
|
||||
@@ -129,7 +115,7 @@ pub enum Delimiter {
|
||||
None,
|
||||
}
|
||||
|
||||
#[derive(PartialEq, Eq, Hash, Clone, Serialize)]
|
||||
#[derive(PartialEq, Clone, Serialize)]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
pub struct UserInputLiteral {
|
||||
pub field_name: Option<String>,
|
||||
@@ -168,7 +154,7 @@ impl fmt::Debug for UserInputLiteral {
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(PartialEq, Eq, Hash, Debug, Clone, Serialize)]
|
||||
#[derive(PartialEq, Debug, Clone, Serialize)]
|
||||
#[serde(tag = "type", content = "value")]
|
||||
#[serde(rename_all = "snake_case")]
|
||||
pub enum UserInputBound {
|
||||
@@ -205,11 +191,11 @@ impl UserInputBound {
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(PartialEq, Eq, Hash, Clone, Serialize)]
|
||||
#[derive(PartialEq, Clone, Serialize)]
|
||||
#[serde(into = "UserInputAstSerde")]
|
||||
pub enum UserInputAst {
|
||||
Clause(Vec<(Option<Occur>, UserInputAst)>),
|
||||
Boost(Box<UserInputAst>, ordered_float::OrderedFloat<f64>),
|
||||
Boost(Box<UserInputAst>, f64),
|
||||
Leaf(Box<UserInputLeaf>),
|
||||
}
|
||||
|
||||
@@ -231,10 +217,9 @@ impl From<UserInputAst> for UserInputAstSerde {
|
||||
fn from(ast: UserInputAst) -> Self {
|
||||
match ast {
|
||||
UserInputAst::Clause(clause) => UserInputAstSerde::Bool { clauses: clause },
|
||||
UserInputAst::Boost(underlying, boost) => UserInputAstSerde::Boost {
|
||||
underlying,
|
||||
boost: boost.into_inner(),
|
||||
},
|
||||
UserInputAst::Boost(underlying, boost) => {
|
||||
UserInputAstSerde::Boost { underlying, boost }
|
||||
}
|
||||
UserInputAst::Leaf(leaf) => UserInputAstSerde::Leaf(leaf),
|
||||
}
|
||||
}
|
||||
@@ -393,7 +378,7 @@ mod tests {
|
||||
#[test]
|
||||
fn test_boost_serialization() {
|
||||
let inner_ast = UserInputAst::Leaf(Box::new(UserInputLeaf::All));
|
||||
let boost_ast = UserInputAst::Boost(Box::new(inner_ast), 2.5.into());
|
||||
let boost_ast = UserInputAst::Boost(Box::new(inner_ast), 2.5);
|
||||
let json = serde_json::to_string(&boost_ast).unwrap();
|
||||
assert_eq!(
|
||||
json,
|
||||
@@ -420,7 +405,7 @@ mod tests {
|
||||
}))),
|
||||
),
|
||||
])),
|
||||
2.5.into(),
|
||||
2.5,
|
||||
);
|
||||
let json = serde_json::to_string(&boost_ast).unwrap();
|
||||
assert_eq!(
|
||||
|
||||
@@ -20,16 +20,17 @@ Contains all metric aggregations, like average aggregation. Metric aggregations
|
||||
#### agg_req
|
||||
agg_req contains the users aggregation request. Deserialization from json is compatible with elasticsearch aggregation requests.
|
||||
|
||||
#### agg_data
|
||||
agg_data contains the users aggregation request enriched with fast field accessors etc, which are
|
||||
#### agg_req_with_accessor
|
||||
agg_req_with_accessor contains the users aggregation request enriched with fast field accessors etc, which are
|
||||
used during collection.
|
||||
|
||||
#### segment_agg_result
|
||||
segment_agg_result contains the aggregation result tree, which is used for collection of a segment.
|
||||
agg_data is passed during collection.
|
||||
The tree from agg_req_with_accessor is passed during collection.
|
||||
|
||||
#### intermediate_agg_result
|
||||
intermediate_agg_result contains the aggregation tree for merging with other trees.
|
||||
|
||||
#### agg_result
|
||||
agg_result contains the final aggregation tree.
|
||||
|
||||
|
||||
@@ -1,105 +0,0 @@
|
||||
//! This will enhance the request tree with access to the fastfield and metadata.
|
||||
|
||||
use std::io;
|
||||
|
||||
use columnar::{Column, ColumnType};
|
||||
|
||||
use crate::aggregation::{f64_to_fastfield_u64, Key};
|
||||
use crate::index::SegmentReader;
|
||||
|
||||
/// Get the missing value as internal u64 representation
|
||||
///
|
||||
/// For terms we use u64::MAX as sentinel value
|
||||
/// For numerical data we convert the value into the representation
|
||||
/// we would get from the fast field, when we open it as u64_lenient_for_type.
|
||||
///
|
||||
/// That way we can use it the same way as if it would come from the fastfield.
|
||||
pub(crate) fn get_missing_val_as_u64_lenient(
|
||||
column_type: ColumnType,
|
||||
column_max_value: u64,
|
||||
missing: &Key,
|
||||
field_name: &str,
|
||||
) -> crate::Result<Option<u64>> {
|
||||
let missing_val = match missing {
|
||||
Key::Str(_) if column_type == ColumnType::Str => Some(column_max_value + 1),
|
||||
// Allow fallback to number on text fields
|
||||
Key::F64(_) if column_type == ColumnType::Str => Some(column_max_value + 1),
|
||||
Key::U64(_) if column_type == ColumnType::Str => Some(column_max_value + 1),
|
||||
Key::I64(_) if column_type == ColumnType::Str => Some(column_max_value + 1),
|
||||
Key::F64(val) if column_type.numerical_type().is_some() => {
|
||||
f64_to_fastfield_u64(*val, &column_type)
|
||||
}
|
||||
// NOTE: We may loose precision of the passed missing value by casting i64 and u64 to f64.
|
||||
Key::I64(val) if column_type.numerical_type().is_some() => {
|
||||
f64_to_fastfield_u64(*val as f64, &column_type)
|
||||
}
|
||||
Key::U64(val) if column_type.numerical_type().is_some() => {
|
||||
f64_to_fastfield_u64(*val as f64, &column_type)
|
||||
}
|
||||
_ => {
|
||||
return Err(crate::TantivyError::InvalidArgument(format!(
|
||||
"Missing value {missing:?} for field {field_name} is not supported for column \
|
||||
type {column_type:?}"
|
||||
)));
|
||||
}
|
||||
};
|
||||
Ok(missing_val)
|
||||
}
|
||||
|
||||
pub(crate) fn get_numeric_or_date_column_types() -> &'static [ColumnType] {
|
||||
&[
|
||||
ColumnType::F64,
|
||||
ColumnType::U64,
|
||||
ColumnType::I64,
|
||||
ColumnType::DateTime,
|
||||
]
|
||||
}
|
||||
|
||||
/// Get fast field reader or empty as default.
|
||||
pub(crate) fn get_ff_reader(
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
allowed_column_types: Option<&[ColumnType]>,
|
||||
) -> crate::Result<(columnar::Column<u64>, ColumnType)> {
|
||||
let ff_fields = reader.fast_fields();
|
||||
let ff_field_with_type = ff_fields
|
||||
.u64_lenient_for_type(allowed_column_types, field_name)?
|
||||
.unwrap_or_else(|| {
|
||||
(
|
||||
Column::build_empty_column(reader.num_docs()),
|
||||
ColumnType::U64,
|
||||
)
|
||||
});
|
||||
Ok(ff_field_with_type)
|
||||
}
|
||||
|
||||
pub(crate) fn get_dynamic_columns(
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
) -> crate::Result<Vec<columnar::DynamicColumn>> {
|
||||
let ff_fields = reader.fast_fields().dynamic_column_handles(field_name)?;
|
||||
let cols = ff_fields
|
||||
.iter()
|
||||
.map(|h| h.open())
|
||||
.collect::<io::Result<_>>()?;
|
||||
assert!(!ff_fields.is_empty(), "field {field_name} not found");
|
||||
Ok(cols)
|
||||
}
|
||||
|
||||
/// Get all fast field reader or empty as default.
|
||||
///
|
||||
/// Is guaranteed to return at least one column.
|
||||
pub(crate) fn get_all_ff_reader_or_empty(
|
||||
reader: &SegmentReader,
|
||||
field_name: &str,
|
||||
allowed_column_types: Option<&[ColumnType]>,
|
||||
fallback_type: ColumnType,
|
||||
) -> crate::Result<Vec<(columnar::Column<u64>, ColumnType)>> {
|
||||
let ff_fields = reader.fast_fields();
|
||||
let mut ff_field_with_type =
|
||||
ff_fields.u64_lenient_for_type_all(allowed_column_types, field_name)?;
|
||||
if ff_field_with_type.is_empty() {
|
||||
ff_field_with_type.push((Column::build_empty_column(reader.num_docs()), fallback_type));
|
||||
}
|
||||
Ok(ff_field_with_type)
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user