mirror of
https://github.com/lancedb/lancedb.git
synced 2026-01-07 04:12:59 +00:00
docs: provide AWS S3 cleanup and permissions advice (#903)
Adding some more quick advice for how to setup AWS S3 with LanceDB. --------- Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
This commit is contained in:
@@ -68,6 +68,82 @@ Alternatively, if you are using AWS SSO, you can use the `AWS_PROFILE` and `AWS_
|
||||
|
||||
You can see a full list of environment variables [here](https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3Builder.html#method.from_env).
|
||||
|
||||
!!! tip "Automatic cleanup for failed writes"
|
||||
|
||||
LanceDB uses [multi-part uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) when writing data to S3 in order to maximize write speed. LanceDB will abort these uploads when it shuts down gracefully, such as when cancelled by keyboard interrupt. However, in the rare case that LanceDB crashes, it is possible that some data will be left lingering in your account. To cleanup this data, we recommend (as AWS themselves do) that you setup a lifecycle rule to delete in-progress uploads after 7 days. See the AWS guide:
|
||||
|
||||
**[Configuring a bucket lifecycle configuration to delete incomplete multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpu-abort-incomplete-mpu-lifecycle-config.html)**
|
||||
|
||||
#### AWS IAM Permissions
|
||||
|
||||
If a bucket is private, then an IAM policy must be specified to allow access to it. For many development scenarios, using broad permissions such as a PowerUser account is more than sufficient for working with LanceDB. However, in many production scenarios, you may wish to have as narrow as possible permissions.
|
||||
|
||||
For **read and write access**, LanceDB will need a policy such as:
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:PutObject",
|
||||
"s3:GetObject",
|
||||
"s3:DeleteObject",
|
||||
],
|
||||
"Resource": "arn:aws:s3:::<bucket>/<prefix>/*"
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:ListBucket",
|
||||
"s3:GetBucketLocation"
|
||||
],
|
||||
"Resource": "arn:aws:s3:::<bucket>",
|
||||
"Condition": {
|
||||
"StringLike": {
|
||||
"s3:prefix": [
|
||||
"<prefix>/*"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
For **read-only access**, LanceDB will need a policy such as:
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:GetObject",
|
||||
],
|
||||
"Resource": "arn:aws:s3:::<bucket>/<prefix>/*"
|
||||
},
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:ListBucket",
|
||||
"s3:GetBucketLocation"
|
||||
],
|
||||
"Resource": "arn:aws:s3:::<bucket>",
|
||||
"Condition": {
|
||||
"StringLike": {
|
||||
"s3:prefix": [
|
||||
"<prefix>/*"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### S3-compatible stores
|
||||
|
||||
LanceDB can also connect to S3-compatible stores, such as MinIO. To do so, you must specify two environment variables: `AWS_ENDPOINT` and `AWS_DEFAULT_REGION`. `AWS_ENDPOINT` should be the URL of the S3-compatible store, and `AWS_DEFAULT_REGION` should be the region to use.
|
||||
|
||||
Reference in New Issue
Block a user