From 114866fbcfd3689f0b22890bb338fd049b08ab41 Mon Sep 17 00:00:00 2001 From: QianZhu Date: Wed, 20 Nov 2024 17:51:11 -0800 Subject: [PATCH] docs: OSS doc improvement (#1859) OSS doc improvement - HNSW index parameter explanation and others. --------- Co-authored-by: BubbleCal --- docs/src/ann_indexes.md | 14 +++++++++++--- docs/src/concepts/index_hnsw.md | 7 +++++++ docs/src/concepts/index_ivfpq.md | 4 +++- docs/src/reranking/cohere.md | 3 +++ 4 files changed, 24 insertions(+), 4 deletions(-) diff --git a/docs/src/ann_indexes.md b/docs/src/ann_indexes.md index 5b3bac30..2d80c48e 100644 --- a/docs/src/ann_indexes.md +++ b/docs/src/ann_indexes.md @@ -277,7 +277,15 @@ Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64` t Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train. On `SIFT-1M` dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency / recall. -`num_sub_vectors` specifies how many Product Quantization (PQ) short codes to generate on each vector. Because +`num_sub_vectors` specifies how many Product Quantization (PQ) short codes to generate on each vector. The number should be a factor of the vector dimension. Because PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in -less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and -more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency. +less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency. + +!!! note + if `num_sub_vectors` is set to be greater than the vector dimension, you will see errors like `attempt to divide by zero` + +### How to choose `m` and `ef_construction` for `IVF_HNSW_*` index? + +`m` determines the number of connections a new node establishes with its closest neighbors upon entering the graph. Typically, `m` falls within the range of 5 to 48. Lower `m` values are suitable for low-dimensional data or scenarios where recall is less critical. Conversely, higher `m` values are beneficial for high-dimensional data or when high recall is required. In essence, a larger `m` results in a denser graph with increased connectivity, but at the expense of higher memory consumption. + +`ef_construction` balances build speed and accuracy. Higher values increase accuracy but slow down the build process. A typical range is 150 to 300. For good search results, a minimum value of 100 is recommended. In most cases, setting this value above 500 offers no additional benefit. Ensure that `ef_construction` is always set to a value equal to or greater than `ef` in the search phase \ No newline at end of file diff --git a/docs/src/concepts/index_hnsw.md b/docs/src/concepts/index_hnsw.md index 8bfaf39c..0e9930b9 100644 --- a/docs/src/concepts/index_hnsw.md +++ b/docs/src/concepts/index_hnsw.md @@ -57,6 +57,13 @@ Then the greedy search routine operates as follows: ## Usage +There are three key parameters to set when constructing an HNSW index: + +* `metric`: Use an `L2` euclidean distance metric. We also support `dot` and `cosine` distance. +* `m`: The number of neighbors to select for each vector in the HNSW graph. +* `ef_construction`: The number of candidates to evaluate during the construction of the HNSW graph. + + We can combine the above concepts to understand how to build and query an HNSW index in LanceDB. ### Construct index diff --git a/docs/src/concepts/index_ivfpq.md b/docs/src/concepts/index_ivfpq.md index cf522557..7220d2c8 100644 --- a/docs/src/concepts/index_ivfpq.md +++ b/docs/src/concepts/index_ivfpq.md @@ -58,8 +58,10 @@ In Python, the index can be created as follows: # Make sure you have enough data in the table for an effective training step tbl.create_index(metric="L2", num_partitions=256, num_sub_vectors=96) ``` +!!! note + `num_partitions`=256 and `num_sub_vectors`=96 does not work for every dataset. Those values needs to be adjusted for your particular dataset. -The `num_partitions` is usually chosen to target a particular number of vectors per partition. `num_sub_vectors` is typically chosen based on the desired recall and the dimensionality of the vector. See the [FAQs](#faq) below for best practices on choosing these parameters. +The `num_partitions` is usually chosen to target a particular number of vectors per partition. `num_sub_vectors` is typically chosen based on the desired recall and the dimensionality of the vector. See [here](../ann_indexes.md/#how-to-choose-num_partitions-and-num_sub_vectors-for-ivf_pq-index) for best practices on choosing these parameters. ### Query the index diff --git a/docs/src/reranking/cohere.md b/docs/src/reranking/cohere.md index 50b72e56..f6e54499 100644 --- a/docs/src/reranking/cohere.md +++ b/docs/src/reranking/cohere.md @@ -6,6 +6,9 @@ This re-ranker uses the [Cohere](https://cohere.ai/) API to rerank the search re !!! note Supported Query Types: Hybrid, Vector, FTS +```shell +pip install cohere +``` ```python import numpy