feat!: add variable store to embeddings registry (#2112)

BREAKING CHANGE: embedding function implementations in Node need to now
call `resolveVariables()` in their constructors and should **not**
implement `toJSON()`.

This tries to address the handling of secrets. In Node, they are
currently lost. In Python, they are currently leaked into the table
schema metadata.

This PR introduces an in-memory variable store on the function registry.
It also allows embedding function definitions to label certain config
values as "sensitive", and the preprocessing logic will raise an error
if users try to pass in hard-coded values.

Closes #2110
Closes #521

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
This commit is contained in:
Will Jones
2025-02-24 15:52:19 -08:00
committed by GitHub
parent ecdee4d2b1
commit 7ac5f74c80
24 changed files with 699 additions and 175 deletions

View File

@@ -55,6 +55,14 @@ Let's implement `SentenceTransformerEmbeddings` class. All you need to do is imp
This is a stripped down version of our implementation of `SentenceTransformerEmbeddings` that removes certain optimizations and default settings.
!!! danger "Use sensitive keys to prevent leaking secrets"
To prevent leaking secrets, such as API keys, you should add any sensitive
parameters of an embedding function to the output of the
[sensitive_keys()][lancedb.embeddings.base.EmbeddingFunction.sensitive_keys] /
[getSensitiveKeys()](../../js/namespaces/embedding/classes/EmbeddingFunction/#getsensitivekeys)
method. This prevents users from accidentally instantiating the embedding
function with hard-coded secrets.
Now you can use this embedding function to create your table schema and that's it! you can then ingest data and run queries without manually vectorizing the inputs.
=== "Python"

View File

@@ -0,0 +1,53 @@
# Variable and Secrets
Most embedding configuration options are saved in the table's metadata. However,
this isn't always appropriate. For example, API keys should never be stored in the
metadata. Additionally, other configuration options might be best set at runtime,
such as the `device` configuration that controls whether to use GPU or CPU for
inference. If you hardcoded this to GPU, you wouldn't be able to run the code on
a server without one.
To handle these cases, you can set variables on the embedding registry and
reference them in the embedding configuration. These variables will be available
during the runtime of your program, but not saved in the table's metadata. When
the table is loaded from a different process, the variables must be set again.
To set a variable, use the `set_var()` / `setVar()` method on the embedding registry.
To reference a variable, use the syntax `$env:VARIABLE_NAME`. If there is a default
value, you can use the syntax `$env:VARIABLE_NAME:DEFAULT_VALUE`.
## Using variables to set secrets
Sensitive configuration, such as API keys, must either be set as environment
variables or using variables on the embedding registry. If you pass in a hardcoded
value, LanceDB will raise an error. Instead, if you want to set an API key via
configuration, use a variable:
=== "Python"
```python
--8<-- "python/python/tests/docs/test_embeddings_optional.py:register_secret"
```
=== "Typescript"
```typescript
--8<-- "nodejs/examples/embedding.test.ts:register_secret"
```
## Using variables to set the device parameter
Many embedding functions that run locally have a `device` parameter that controls
whether to use GPU or CPU for inference. Because not all computers have a GPU,
it's helpful to be able to set the `device` parameter at runtime, rather than
have it hard coded in the embedding configuration. To make it work even if the
variable isn't set, you could provide a default value of `cpu` in the embedding
configuration.
Some embedding libraries even have a method to detect which devices are available,
which could be used to dynamically set the device at runtime. For example, in Python
you can check if a CUDA GPU is available using `torch.cuda.is_available()`.
```python
--8<-- "python/python/tests/docs/test_embeddings_optional.py:register_device"
```

View File

@@ -8,6 +8,23 @@
An embedding function that automatically creates vector representation for a given column.
It's important subclasses pass the **original** options to the super constructor
and then pass those options to `resolveVariables` to resolve any variables before
using them.
## Example
```ts
class MyEmbeddingFunction extends EmbeddingFunction {
constructor(options: {model: string, timeout: number}) {
super(optionsRaw);
const options = this.resolveVariables(optionsRaw);
this.model = options.model;
this.timeout = options.timeout;
}
}
```
## Extended by
- [`TextEmbeddingFunction`](TextEmbeddingFunction.md)
@@ -82,12 +99,33 @@ The datatype of the embeddings
***
### getSensitiveKeys()
```ts
protected getSensitiveKeys(): string[]
```
Provide a list of keys in the function options that should be treated as
sensitive. If users pass raw values for these keys, they will be rejected.
#### Returns
`string`[]
***
### init()?
```ts
optional init(): Promise<void>
```
Optionally load any resources needed for the embedding function.
This method is called after the embedding function has been initialized
but before any embeddings are computed. It is useful for loading local models
or other resources that are needed for the embedding function to work.
#### Returns
`Promise`&lt;`void`&gt;
@@ -108,6 +146,24 @@ The number of dimensions of the embeddings
***
### resolveVariables()
```ts
protected resolveVariables(config): Partial<M>
```
Apply variables to the config.
#### Parameters
* **config**: `Partial`&lt;`M`&gt;
#### Returns
`Partial`&lt;`M`&gt;
***
### sourceField()
```ts
@@ -134,37 +190,15 @@ sourceField is used in combination with `LanceSchema` to provide a declarative d
### toJSON()
```ts
abstract toJSON(): Partial<M>
toJSON(): Record<string, any>
```
Convert the embedding function to a JSON object
It is used to serialize the embedding function to the schema
It's important that any object returned by this method contains all the necessary
information to recreate the embedding function
It should return the same object that was passed to the constructor
If it does not, the embedding function will not be able to be recreated, or could be recreated incorrectly
Get the original arguments to the constructor, to serialize them so they
can be used to recreate the embedding function later.
#### Returns
`Partial`&lt;`M`&gt;
#### Example
```ts
class MyEmbeddingFunction extends EmbeddingFunction {
constructor(options: {model: string, timeout: number}) {
super();
this.model = options.model;
this.timeout = options.timeout;
}
toJSON() {
return {
model: this.model,
timeout: this.timeout,
};
}
```
`Record`&lt;`string`, `any`&gt;
***

View File

@@ -80,6 +80,28 @@ getTableMetadata(functions): Map<string, string>
***
### getVar()
```ts
getVar(name): undefined | string
```
Get a variable.
#### Parameters
* **name**: `string`
#### Returns
`undefined` \| `string`
#### See
[setVar](EmbeddingFunctionRegistry.md#setvar)
***
### length()
```ts
@@ -145,3 +167,31 @@ reset the registry to the initial state
#### Returns
`void`
***
### setVar()
```ts
setVar(name, value): void
```
Set a variable. These can be accessed in the embedding function
configuration using the syntax `$var:variable_name`. If they are not
set, an error will be thrown letting you know which key is unset. If you
want to supply a default value, you can add an additional part in the
configuration like so: `$var:variable_name:default_value`. Default values
can be used for runtime configurations that are not sensitive, such as
whether to use a GPU for inference.
The name must not contain colons. The default value can contain colons.
#### Parameters
* **name**: `string`
* **value**: `string`
#### Returns
`void`

View File

@@ -114,12 +114,37 @@ abstract generateEmbeddings(texts, ...args): Promise<number[][] | Float32Array[]
***
### getSensitiveKeys()
```ts
protected getSensitiveKeys(): string[]
```
Provide a list of keys in the function options that should be treated as
sensitive. If users pass raw values for these keys, they will be rejected.
#### Returns
`string`[]
#### Inherited from
[`EmbeddingFunction`](EmbeddingFunction.md).[`getSensitiveKeys`](EmbeddingFunction.md#getsensitivekeys)
***
### init()?
```ts
optional init(): Promise<void>
```
Optionally load any resources needed for the embedding function.
This method is called after the embedding function has been initialized
but before any embeddings are computed. It is useful for loading local models
or other resources that are needed for the embedding function to work.
#### Returns
`Promise`&lt;`void`&gt;
@@ -148,6 +173,28 @@ The number of dimensions of the embeddings
***
### resolveVariables()
```ts
protected resolveVariables(config): Partial<M>
```
Apply variables to the config.
#### Parameters
* **config**: `Partial`&lt;`M`&gt;
#### Returns
`Partial`&lt;`M`&gt;
#### Inherited from
[`EmbeddingFunction`](EmbeddingFunction.md).[`resolveVariables`](EmbeddingFunction.md#resolvevariables)
***
### sourceField()
```ts
@@ -173,37 +220,15 @@ sourceField is used in combination with `LanceSchema` to provide a declarative d
### toJSON()
```ts
abstract toJSON(): Partial<M>
toJSON(): Record<string, any>
```
Convert the embedding function to a JSON object
It is used to serialize the embedding function to the schema
It's important that any object returned by this method contains all the necessary
information to recreate the embedding function
It should return the same object that was passed to the constructor
If it does not, the embedding function will not be able to be recreated, or could be recreated incorrectly
Get the original arguments to the constructor, to serialize them so they
can be used to recreate the embedding function later.
#### Returns
`Partial`&lt;`M`&gt;
#### Example
```ts
class MyEmbeddingFunction extends EmbeddingFunction {
constructor(options: {model: string, timeout: number}) {
super();
this.model = options.model;
this.timeout = options.timeout;
}
toJSON() {
return {
model: this.model,
timeout: this.timeout,
};
}
```
`Record`&lt;`string`, `any`&gt;
#### Inherited from