Files
neon/test_runner/performance/pgvector
Peter Bendel 46210035c5 add halfvec indexing and queries to periodic pgvector performance tests (#8057)
## Problem

halfvec data type was introduced in pgvector 0.7.0 and is popular
because
it allows smaller vectors, smaller indexes and potentially better
performance.

So far we have not tested halfvec in our periodic performance tests.
This PR adds halfvec indexing and halfvec queries to the test.
2024-06-14 18:36:50 +02:00
..

Source of the dataset for pgvector tests

This readme was copied from https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M

Download the parquet files

brew install git-lfs
git-lfs clone https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-1M

Load into postgres:

see loaddata.py in this directory

Rest of dataset card as on huggingface


dataset_info: features:

  • name: _id dtype: string
  • name: title dtype: string
  • name: text dtype: string
  • name: text-embedding-3-large-1536-embedding sequence: float64 splits:
  • name: train num_bytes: 12679725776 num_examples: 1000000 download_size: 9551862565 dataset_size: 12679725776 configs:
  • config_name: default data_files:
    • split: train path: data/train-* license: mit task_categories:
  • feature-extraction language:
  • en size_categories:
  • 1M<n<10M

1M OpenAI Embeddings: text-embedding-3-large 1536 dimensions

  • Created: February 2024.
  • Text used for Embedding: title (string) + text (string)
  • Embedding Model: OpenAI text-embedding-3-large
  • This dataset was generated from the first 1M entries of https://huggingface.co/datasets/BeIR/dbpedia-entity, extracted by @KShivendu_