Sql filter document (#228)

This commit is contained in:
Lei Xu
2023-06-26 12:22:22 -07:00
committed by GitHub
parent 0e4c52b8a6
commit 8c5507075c
4 changed files with 102 additions and 3 deletions

6
Cargo.lock generated
View File

@@ -3364,13 +3364,13 @@ checksum = "accd4ea62f7bb7a82fe23066fb0957d48ef677f6eeb8215f372f52e48bb32426"
[[package]]
name = "vectordb"
version = "0.1.8"
version = "0.1.9"
dependencies = [
"arrow-array",
"arrow-data",
"arrow-schema",
"lance",
"object_store 0.5.6",
"object_store 0.6.1",
"rand",
"snafu",
"tempfile",
@@ -3379,7 +3379,7 @@ dependencies = [
[[package]]
name = "vectordb-node"
version = "0.1.8"
version = "0.1.9"
dependencies = [
"arrow-array",
"arrow-ipc",

View File

@@ -38,6 +38,7 @@ plugins:
markdown_extensions:
- admonition
- footnotes
- pymdownx.superfences
- pymdownx.details
- pymdownx.highlight:
@@ -66,6 +67,7 @@ nav:
- YouTube Transcript Search: examples/youtube_transcript_bot_with_nodejs.md
- References:
- Vector Search: search.md
- SQL filters: sql.md
- Indexing: ann_indexes.md
- API references:
- Python API: python/python.md

View File

@@ -1,4 +1,5 @@
mkdocs==1.4.2
mkdocs-jupyter==0.24.1
mkdocs-material==9.1.3
mkdocs-footnotes
mkdocstrings[python]==0.20.0

96
docs/src/sql.md Normal file
View File

@@ -0,0 +1,96 @@
# SQL filters
LanceDB embraces the utilization of standard SQL expressions as predicates for hybrid
filters. It can be used during hybrid vector search and deletion operations.
Currently, Lance supports a growing list of expressions.
* ``>``, ``>=``, ``<``, ``<=``, ``=``
* ``AND``, ``OR``, ``NOT``
* ``IS NULL``, ``IS NOT NULL``
* ``IS TRUE``, ``IS NOT TRUE``, ``IS FALSE``, ``IS NOT FALSE``
* ``IN``
* ``LIKE``, ``NOT LIKE``
* ``CAST``
* ``regexp_match(column, pattern)``
For example, the following filter string is acceptable:
=== "Python"
```python
tbl.search([100, 102])
.where("""(
(label IN [10, 20])
AND
(note.email IS NOT NULL)
) OR NOT note.created
""")
```
=== "Javascript"
```javascript
tbl.search([100, 102])
.where(`(
(label IN [10, 20])
AND
(note.email IS NOT NULL)
) OR NOT note.created
`)
```
If your column name contains special characters or is a [SQL Keyword](https://docs.rs/sqlparser/latest/sqlparser/keywords/index.html),
you can use backtick (`` ` ``) to escape it. For nested fields, each segment of the
path must be wrapped in backticks.
=== "SQL"
```sql
`CUBE` = 10 AND `column name with space` IS NOT NULL
AND `nested with space`.`inner with space` < 2
```
!!! warning
Field names containing periods (``.``) are not supported.
Literals for dates, timestamps, and decimals can be written by writing the string
value after the type name. For example
=== "SQL"
```sql
date_col = date '2021-01-01'
and timestamp_col = timestamp '2021-01-01 00:00:00'
and decimal_col = decimal(8,3) '1.000'
```
For timestamp columns, the precision can be specified as a number in the type
parameter. Microsecond precision (6) is the default.
| SQL | Time unit |
|------------------|--------------|
| ``timestamp(0)`` | Seconds |
| ``timestamp(3)`` | Milliseconds |
| ``timestamp(6)`` | Microseconds |
| ``timestamp(9)`` | Nanoseconds |
LanceDB internally stores data in [Apache Arrow](https://arrow.apache.org/) format.
The mapping from SQL types to Arrow types is:
| SQL type | Arrow type |
|----------|------------|
| ``boolean`` | ``Boolean`` |
| ``tinyint`` / ``tinyint unsigned`` | ``Int8`` / ``UInt8`` |
| ``smallint`` / ``smallint unsigned`` | ``Int16`` / ``UInt16`` |
| ``int`` or ``integer`` / ``int unsigned`` or ``integer unsigned`` | ``Int32`` / ``UInt32`` |
| ``bigint`` / ``bigint unsigned`` | ``Int64`` / ``UInt64`` |
| ``float`` | ``Float32`` |
| ``double`` | ``Float64`` |
| ``decimal(precision, scale)`` | ``Decimal128`` |
| ``date`` | ``Date32`` |
| ``timestamp`` | ``Timestamp`` [^1] |
| ``string`` | ``Utf8`` |
| ``binary`` | ``Binary`` |
[^1]: See precision mapping in previous table.