Compare commits

..

2 Commits

Author SHA1 Message Date
Dan Tasse
49570ddebf Updates 2026-06-12 17:04:46 -04:00
Dan Tasse
d43e436ad8 feat: Skills to connect and update column metadata 2026-06-12 15:20:59 -04:00
2 changed files with 220 additions and 0 deletions

View File

@@ -0,0 +1,178 @@
---
name: lancedb-column-metadata
description: Column metadata authoring for LanceDB tables via the REST API. This skill is required for tasks like writing field descriptions, setting tags on columns (field_type, model, project_id, version), classifying columns as embeddings vs labels vs eval metrics, or grouping versioned columns into logical families — because it has the API integration needed to read the schema and persist metadata back. Invoke whenever someone wants to document, annotate, tag, or classify what their table columns ARE. Trigger even without an explicit "LanceDB" mention, as long as the context is column-level documentation or tagging for an ML or vector database table.
metadata:
short-description: Write column descriptions, tags, and logical groupings to a LanceDB table
---
## Overview
This skill authors column-level metadata for a LanceDB table. It connects to a LanceDB deployment over its REST API, inspects the table schema, generates appropriate metadata, and writes it back.
## Step 0: Establish the connection
Use the `lancedb-connect` skill (invoke it via the Skill tool) to resolve the base URL and auth headers (`x-api-key`, `x-lancedb-database`) for whichever deployment the user is working against — enterprise/self-hosted or a local dev server. Skip it only if the connection details are already established in the conversation.
All examples below use `{base_url}` — substitute the resolved endpoint and include the resolved headers on every request.
## Metadata keys
All metadata uses namespaced keys:
| Key | Purpose | Example value |
|-----|---------|---------------|
| `lancedb:description` | Human-readable explanation of what the column contains | `"CLIP ViT-L/14 image embedding, L2-normalized (768-dim)"` |
| `lancedb:tag:<name>` | Flexible key-value tag; the suffix names the tag category | `lancedb:tag:field_type: "embedding"`, `lancedb:tag:model: "clip"`, `lancedb:tag:project_id: "foo"` |
| `lancedb:logical-column` | Logical group/family this column belongs to | `"clip_features"` |
Tags are open-ended — use whatever key suffix and value make sense given the user's intent. The tag suffix should describe *what is being classified* (e.g., `field_type`, `model`, `project_id`) and the value describes *how*.
## Step 1: Resolve the table identifier
You need:
- **Table name** (required) — e.g., `my_table` or `my_namespace.my_table`
- **Database name** — ask if not provided and not inferable from context; it goes in the `x-lancedb-database` header, never in the URL path
The table identifier in the URL path is typically `table_name` for a top-level table, or `namespace$table_name` if the table lives in a namespace. The API accepts a `delimiter` query parameter to parse compound identifiers (default `$`).
## Step 2: Describe the table
```http
POST {base_url}/v1/table/{table_id}/describe
Content-Type: application/json
{}
```
The response contains `schema.fields` — an array of field objects:
```json
{
"schema": {
"fields": [
{
"name": "clip_embedding_v3",
"type": { "type": "FixedSizeList", "fields": [...], "listSize": 768 },
"nullable": true,
"metadata": { "lancedb:description": "..." }
}
]
}
}
```
Each field has:
- `name` — field name
- `type` — Arrow data type (check `type.type` for the type string)
- `nullable` — boolean
- `metadata` — existing key-value metadata (read this before writing to avoid redundant updates)
For struct/nested fields, recurse into `type.fields` and represent them as dot-notation paths (e.g., `parent.child`).
If the user hasn't specified which columns to update, work with all columns.
## Step 3: Generate metadata
Decide what to generate based on the user's request.
### Writing descriptions (`lancedb:description`)
Base descriptions on:
- The column name and Arrow type (e.g., `FixedSizeList` of floats → likely an embedding)
- User-supplied context (upstream pipeline, sample values, domain knowledge)
- Name patterns: `_embedding`/`_vec`/`_embed` → vector; `_label`/`_class` → label; `_score`/`_eval`/`_metric` → evaluation metric
Be specific and concise. Good: `"Sentence-BERT embedding of the query text (768-dim)."` Not: `"An embedding column."`
### Tagging columns (`lancedb:tag:<name>`)
Choose tag key names that match what the user asked to annotate. Common patterns:
- Semantic field type → `lancedb:tag:field_type: "embedding"` / `"text"` / `"image"` / `"label"` / `"eval"` / `"id"` / `"metadata"`
- Model or source → `lancedb:tag:model: "clip"` / `"bert"` / `"vit"`
- Project affiliation → `lancedb:tag:project_id: "<name>"`
- Version → `lancedb:tag:version: "v3"` (and `lancedb:tag:latest: "true"` for the newest)
Use Arrow type as a hint: `FixedSizeList` + float → embedding; `Utf8`/`LargeUtf8` → text; `Binary` → image or blob.
Multiple tags on the same column are fine — each is a separate key.
### Grouping into logical columns (`lancedb:logical-column`)
Look for naming patterns across columns:
- `clip_v1`, `clip_v2`, `clip_v3` → logical column `"clip"`, latest is `v3`
- `text_embed_20240101`, `text_embed_20240601` → logical column `"text_embed"`, latest is the most recent date suffix
Write `lancedb:logical-column` on all members of a group. Mark the newest with `lancedb:tag:latest: "true"` (in addition to its version tag).
## Step 4: Write the metadata
```http
POST {base_url}/v1/table/{table_id}/update_field_metadata
Content-Type: application/json
{
"updates": [
{
"path": "clip_v3",
"metadata": {
"lancedb:description": "CLIP ViT-L/14 image embedding, L2-normalized (1024-dim).",
"lancedb:tag:field_type": "embedding",
"lancedb:tag:model": "clip",
"lancedb:tag:version": "v3",
"lancedb:tag:latest": "true",
"lancedb:logical-column": "clip"
},
"replace": false
},
{
"path": "clip_v2",
"metadata": {
"lancedb:description": "CLIP ViT-B/32 image embedding (768-dim), superseded by v3.",
"lancedb:tag:field_type": "embedding",
"lancedb:tag:model": "clip",
"lancedb:tag:version": "v2",
"lancedb:logical-column": "clip"
},
"replace": false
}
]
}
```
Rules:
- **Use `"replace": false`** (merge) by default — this preserves existing metadata the user didn't ask to change
- Use `"replace": true` only if the user explicitly asks to overwrite all existing metadata on a column
- Set a value to `null` to delete a specific key
- Batch all updates in a single request when possible
The response includes `version` (new table version) and `fields` (the updated metadata per field).
## Step 5: Confirm
Report back:
- Which columns were updated and what was written
- The new table version number
- Any columns skipped (e.g., already had up-to-date metadata)
---
## Quick examples
**"Write descriptions for all columns in the `product_embeddings` table"**
1. POST `/v1/table/product_embeddings/describe` → get all fields
2. Generate a `lancedb:description` for each column based on name + type
3. POST `update_field_metadata` with descriptions
4. Report
**"Tag the columns in `model_outputs` with their field type and model"**
1. Describe `model_outputs`
2. For each field, classify by name + Arrow type → set `lancedb:tag:field_type` and `lancedb:tag:model` where applicable
3. POST `update_field_metadata`
4. Report
**"Group the feature columns in `training_features` into logical families and mark the latest version"**
1. Describe the table
2. Find version patterns → assign `lancedb:logical-column` and `lancedb:tag:version`; mark newest with `lancedb:tag:latest: "true"`
3. POST `update_field_metadata`
4. Show the grouping

View File

@@ -0,0 +1,42 @@
---
name: lancedb-connect
description: Resolve how to connect to a LanceDB deployment over the REST API — figure out the base URL, API key, and database header. Use this before making any REST requests to a LanceDB table, whenever the endpoint or auth setup is not already known. Also useful on its own when someone asks how to connect, authenticate, or curl their LanceDB instance.
metadata:
short-description: Resolve the base URL and auth headers for a LanceDB deployment
---
## Goal
Produce two things every REST request needs:
1. **Base URL** — the endpoint
2. **Headers**`x-api-key`, and usually `x-lancedb-database`
## Resolution steps
1. If the user already gave a URL and API key (or said which environment they're working against), use that.
2. Otherwise, look for credentials already available in the environment:
- Env vars like `LANCEDB_URI` / `LANCEDB_HOST` / `LANCEDB_API_KEY`
- A LanceDB endpoint already running or port-forwarded locally (the REST default port is 2333, i.e. `http://localhost:2333`)
3. If you didn't find both pieces, ask the user directly: **"What's your LanceDB endpoint's URL, and what's your API key?"** Also ask which database to use if it isn't obvious. Don't guess or probe further — the user knows their deployment.
## Validating the connection
Make a cheap authenticated request and check the status:
```bash
curl -s -w "\n%{http_code}" "{base_url}/v1/table/?limit=1" \
-H "x-api-key: <key>" \
-H "x-lancedb-database: <database>"
```
- `200` — connection, key, and database header all good
- `401` — API key missing or wrong
- `400` mentioning a database header — this deployment expects `x-lancedb-database`
## Non-REST equivalents
If the caller would rather use the SDK or CLI than raw REST, the same credentials work:
- Python SDK: `lancedb.connect("db://<database>", api_key="<key>", host_override="<base_url>")`
- `lancedb` CLI: a `[profiles.<name>]` entry in `~/.lancedb/config.toml` with `http_server_url`, `api_key`, `database`