feat: dynamodb commit store support (#1410)

This allows users to specify URIs like:

```
s3+ddb://my_bucket/path?ddbTableName=myCommitTable
```

and it will support concurrent writes in S3.

* [x] Add dynamodb integration tests
* [x] Add modifications to get it working in Python sync API
* [x] Added section in documentation describing how to configure.

Closes #534

---------

Co-authored-by: universalmind303 <cory.grinstead@gmail.com>
This commit is contained in:
Will Jones
2024-06-28 09:30:36 -07:00
committed by GitHub
parent d6485f1215
commit 865ed99881
13 changed files with 1844 additions and 58 deletions

View File

@@ -265,6 +265,108 @@ For **read-only access**, LanceDB will need a policy such as:
}
```
#### DynamoDB Commit Store for concurrent writes
By default, S3 does not support concurrent writes. Having two or more processes
writing to the same table at the same time can lead to data corruption. This is
because S3, unlike other object stores, does not have any atomic put or copy
operation.
To enable concurrent writes, you can configure LanceDB to use a DynamoDB table
as a commit store. This table will be used to coordinate writes between
different processes. To enable this feature, you must modify your connection
URI to use the `s3+ddb` scheme and add a query parameter `ddbTableName` with the
name of the table to use.
=== "Python"
```python
import lancedb
db = await lancedb.connect_async(
"s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",
)
```
=== "JavaScript"
```javascript
const lancedb = require("lancedb");
const db = await lancedb.connect(
"s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",
);
```
The DynamoDB table must be created with the following schema:
- Hash key: `base_uri` (string)
- Range key: `version` (number)
You can create this programmatically with:
=== "Python"
<!-- skip-test -->
```python
import boto3
dynamodb = boto3.client("dynamodb")
table = dynamodb.create_table(
TableName=table_name,
KeySchema=[
{"AttributeName": "base_uri", "KeyType": "HASH"},
{"AttributeName": "version", "KeyType": "RANGE"},
],
AttributeDefinitions=[
{"AttributeName": "base_uri", "AttributeType": "S"},
{"AttributeName": "version", "AttributeType": "N"},
],
ProvisionedThroughput={"ReadCapacityUnits": 1, "WriteCapacityUnits": 1},
)
```
=== "JavaScript"
<!-- skip-test -->
```javascript
import {
CreateTableCommand,
DynamoDBClient,
} from "@aws-sdk/client-dynamodb";
const dynamodb = new DynamoDBClient({
region: CONFIG.awsRegion,
credentials: {
accessKeyId: CONFIG.awsAccessKeyId,
secretAccessKey: CONFIG.awsSecretAccessKey,
},
endpoint: CONFIG.awsEndpoint,
});
const command = new CreateTableCommand({
TableName: table_name,
AttributeDefinitions: [
{
AttributeName: "base_uri",
AttributeType: "S",
},
{
AttributeName: "version",
AttributeType: "N",
},
],
KeySchema: [
{ AttributeName: "base_uri", KeyType: "HASH" },
{ AttributeName: "version", KeyType: "RANGE" },
],
ProvisionedThroughput: {
ReadCapacityUnits: 1,
WriteCapacityUnits: 1,
},
});
await client.send(command);
```
#### S3-compatible stores
LanceDB can also connect to S3-compatible stores, such as MinIO. To do so, you must specify both region and endpoint: