Files
lancedb/docs/src/sql.md
Prashanth Rao 119b928a52 docs: Updates and refactor (#683)
This PR makes incremental changes to the documentation.

* Closes #697 
* Closes #698

## Chores
- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.

## Restructure/new content 
- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes #716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes #746
- [x] Add simple polars example to the integrations section. Closes #756
and closes #153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes #781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes #783 and closes #784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes #785
- [ ] Add a benchmark section that also discusses some best practices.
Closes #787

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
2024-01-19 00:18:37 +05:30

4.6 KiB

Filtering

Pre and post-filtering

LanceDB supports filtering of query results based on metadata fields. By default, post-filtering is performed on the top-k results returned by the vector search. However, pre-filtering is also an option that performs the filter prior to vector search. This can be useful to narrow down on the search space on a very large dataset to reduce query latency.

=== "Python" py result = ( tbl.search([0.5, 0.2]) .where("id = 10", prefilter=True) .limit(1) .to_arrow() )

=== "JavaScript" javascript let result = await tbl.search(Array(1536).fill(0.5)) .limit(1) .filter("id = 10") .prefilter(true) .execute()

SQL filters

Because it's built on top of DataFusion, LanceDB embraces the utilization of standard SQL expressions as predicates for filtering operations. It can be used during vector search, update, and deletion operations.

Currently, Lance supports a growing list of SQL expressions.

  • >, >=, <, <=, =
  • AND, OR, NOT
  • IS NULL, IS NOT NULL
  • IS TRUE, IS NOT TRUE, IS FALSE, IS NOT FALSE
  • IN
  • LIKE, NOT LIKE
  • CAST
  • regexp_match(column, pattern)

For example, the following filter string is acceptable:

=== "Python"

```python
tbl.search([100, 102]) \
   .where("(item IN ('item 0', 'item 2')) AND (id > 10)") \
   .to_arrow()
```

=== "Javascript"

```javascript
await tbl.search(Array(1536).fill(0))
   .where("(item IN ('item 0', 'item 2')) AND (id > 10)")
   .execute()
```

If your column name contains special characters or is a SQL Keyword, you can use backtick (`) to escape it. For nested fields, each segment of the path must be wrapped in backticks.

=== "SQL" sql `CUBE` = 10 AND `column name with space` IS NOT NULL AND `nested with space`.`inner with space` < 2

!!! warning Field names containing periods (.) are not supported.

Literals for dates, timestamps, and decimals can be written by writing the string value after the type name. For example

=== "SQL" sql date_col = date '2021-01-01' and timestamp_col = timestamp '2021-01-01 00:00:00' and decimal_col = decimal(8,3) '1.000'

For timestamp columns, the precision can be specified as a number in the type parameter. Microsecond precision (6) is the default.

SQL Time unit
timestamp(0) Seconds
timestamp(3) Milliseconds
timestamp(6) Microseconds
timestamp(9) Nanoseconds

LanceDB internally stores data in Apache Arrow format. The mapping from SQL types to Arrow types is:

SQL type Arrow type
boolean Boolean
tinyint / tinyint unsigned Int8 / UInt8
smallint / smallint unsigned Int16 / UInt16
int or integer / int unsigned or integer unsigned Int32 / UInt32
bigint / bigint unsigned Int64 / UInt64
float Float32
double Float64
decimal(precision, scale) Decimal128
date Date32
timestamp Timestamp 1
string Utf8
binary Binary

You can also filter your data without search.

=== "Python" python tbl.search().where("id = 10").limit(10).to_arrow()

=== "JavaScript" javascript await tbl.where('id = 10').limit(10).execute()

!!! warning If your table is large, this could potentially return a very large amount of data. Please be sure to use a limit clause unless you're sure you want to return the whole result set.


  1. See precision mapping in previous table. ↩︎