neon/pgvector at fabeff822fac24b7ba45214e907295003874252a - neon

rust/neon

mirror of https://github.com/neondatabase/neon.git synced 2026-01-04 12:02:55 +00:00

Files

Peter Bendel fabeff822f Performance test for pgvector HNSW index build and queries (#7873 )

## Problem

We want to regularly verify the performance of pgvector HNSW parallel
index builds and parallel similarity search using HNSW indexes.
The first release that considerably improved the index-build parallelism
was pgvector 0.7.0 and we want to make sure that we do not regress by
our neon compute VM settings (swap, memory over commit, pg conf etc.)

## Summary of changes

Prepare a Neon project with 1 million openAI vector embeddings (vector
size 1536).
Run HNSW indexing operations in the regression test for the various
distance metrics.
Run similarity queries using pgbench with 100 concurrent clients.

I have also added the relevant metrics to the grafana dashboards pgbench
and olape

---------

Co-authored-by: Alexander Bayandin <alexander@neon.tech>

2024-05-28 11:05:33 +00:00

HNSW_build.sql

Performance test for pgvector HNSW index build and queries (#7873 )

2024-05-28 11:05:33 +00:00

IVFFLAT_build.sql

Performance test for pgvector HNSW index build and queries (#7873 )

2024-05-28 11:05:33 +00:00

loaddata.py

Performance test for pgvector HNSW index build and queries (#7873 )

2024-05-28 11:05:33 +00:00

pgbench_custom_script_pgvector_hsnw_queries.sql

Performance test for pgvector HNSW index build and queries (#7873 )

2024-05-28 11:05:33 +00:00

pgbench_hnsw_queries.sql

Performance test for pgvector HNSW index build and queries (#7873 )

2024-05-28 11:05:33 +00:00

README.md

Performance test for pgvector HNSW index build and queries (#7873 )

2024-05-28 11:05:33 +00:00

README.md

dataset_info, configs, license, task_categories, language, size_categories

dataset_info

configs

license

task_categories

language

size_categories

features

splits

download_size

dataset_size

name	dtype
_id	string

name	dtype
title	string

name	dtype
text	string

name	sequence
text-embedding-3-large-1536-embedding	float64

name	num_bytes	num_examples
train	12679725776	1000000

9551862565

12679725776

config_name

data_files

default

split	path
train	data/train-*

mit

feature-extraction

1M<n<10M

1M OpenAI Embeddings: text-embedding-3-large 1536 dimensions

Created: February 2024.
Text used for Embedding: title (string) + text (string)
Embedding Model: OpenAI text-embedding-3-large
This dataset was generated from the first 1M entries of https://huggingface.co/datasets/BeIR/dbpedia-entity, extracted by @KShivendu_ here