Commit Graph

629 Commits

Author SHA1 Message Date
Sebastian Law
eda4c587fc use requests instead of aiohttp for underlying http client (#803)
instead of starting and stopping the current thread's event loop on
every http call, just make an http call.
2024-01-12 09:45:36 +01:00
Chang She
91d64d86e0 chore(python): add docstring for limit behavior (#800)
Closes #796
2024-01-12 09:45:36 +01:00
Chang She
ff81c0d698 feat(python): add phrase query option for fts (#798)
addresses #797 

Problem: tantivy does not expose option to explicitly

Proposed solution here: 

1. Add a `.phrase_query()` option
2. Under the hood, LanceDB takes care of wrapping the input in quotes
and replace nested double quotes with single quotes

I've also filed an upstream issue, if they support phrase queries
natively then we can get rid of our manual custom processing here.
2024-01-12 09:45:36 +01:00
Chang She
fcfb4587bb feat(python): add count_rows with filter option (#801)
Closes #795
2024-01-12 09:45:36 +01:00
Chang She
f43c06d9ce fix(rust): not sure why clippy is suddenly unhappy (#794)
should fix the error on top of main


https://github.com/lancedb/lancedb/actions/runs/7457190471/job/20288985725
2024-01-12 09:45:36 +01:00
Chang She
ba01d274eb feat(python): support new style optional syntax (#793) 2024-01-12 09:45:36 +01:00
Chang She
615c469af2 chore(python): document phrase queries in fts (#788)
closes #769 

Add unit test and documentation on using quotes to perform a phrase
query
2024-01-12 09:45:36 +01:00
Chang She
a649b3b1e4 feat(node): support table.schema for LocalTable (#789)
Close #773 

we pass an empty table over IPC so we don't need to manually deal with
serde. Then we just return the schema attribute from the empty table.

---------

Co-authored-by: albertlockett <albert.lockett@gmail.com>
2024-01-12 09:45:36 +01:00
Lei Xu
be76242884 chore: bump lance to 0.9.5 (#790) 2024-01-12 09:45:36 +01:00
Chang She
f4994cb0ec feat(python): Set heap size to get faster fts indexing performance (#762)
By default tantivy-py uses 128MB heapsize. We change the default to 1GB
and we allow the user to customize this

locally this makes `test_fts.py` run 10x faster
2024-01-12 09:45:36 +01:00
lucasiscovici
00b0c75710 raise exception if fts index does not exist (#776)
raise exception if fts index does not exist

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-01-12 09:45:36 +01:00
sudhir
47299385fa Make examples work with current version of Openai api's (#779)
These examples don't work because of changes in openai api from version
1+
2024-01-12 09:45:36 +01:00
Chris
9dea884a7f Minor Fixes to Ingest Embedding Functions Docs (#777)
Addressed minor typos and grammatical issues to improve readability

---------

Co-authored-by: Christopher Correa <chris.correa@gmail.com>
2024-01-12 09:45:36 +01:00
Vladimir Varankin
85f8cf20aa Minor corrections for docs of embedding_functions (#780)
In addition to #777, this pull request fixes more typos in the
documentation for "Ingest Embedding Functions".
2024-01-12 09:45:36 +01:00
QianZhu
5e720b2776 small bug fix for example code in SaaS JS doc (#770) 2024-01-12 09:45:36 +01:00
Chang She
30a8223944 chore(python): handle NaN input in fts ingestion (#763)
If the input text is None, Tantivy raises an error
complaining it cannot add a NoneType. We handle this
upstream so None's are not added to the document.
If all of the indexed fields are None then we skip
this document.
2024-01-12 09:45:36 +01:00
Bengsoon Chuah
5b1587d84a Add relevant imports for each step (#764)
I found that it was quite incoherent to have to read through the
documentation and having to search which submodule that each class
should be imported from.

For example, it is cumbersome to have to navigate to another
documentation page to find out that `EmbeddingFunctionRegistry` is from
`lancedb.embeddings`
2024-01-12 09:45:36 +01:00
QianZhu
78bafb3007 SaaS JS API sdk doc (#740)
Co-authored-by: Aidan <64613310+aidangomar@users.noreply.github.com>
2024-01-12 09:45:36 +01:00
Chang She
4417f7c5a7 feat(js): support list of string input (#755)
Add support for adding lists of string input (e.g., list of categorical
labels)

Follow-up items: #757 #758
2024-01-12 09:45:36 +01:00
Lance Release
577d6ea16e Updating package-lock.json 2024-01-12 09:45:33 +01:00
Lance Release
53d2ef5e81 Bump version: 0.4.1 → 0.4.2 2024-01-12 09:45:29 +01:00
Lance Release
e48ceb2ebd [python] Bump version: 0.4.2 → 0.4.3 2024-01-12 09:45:29 +01:00
Lei Xu
327692ccb1 chore: bump pylance to 0.9.2 (#754) 2024-01-12 09:45:29 +01:00
Xin Hao
bc224a6a0b docs: fix link (#752) 2024-01-12 09:45:29 +01:00
Chang She
2dcb39f556 feat(python): first cut batch queries for remote api (#753)
issue separate requests under the hood and concatenate results
2024-01-12 09:45:29 +01:00
Lance Release
6bda6f2f2a [python] Bump version: 0.4.1 → 0.4.2 2024-01-12 09:45:29 +01:00
Chang She
a3fafd6b54 chore(python): update embedding API to use openai 1.6.1 (#751)
API has changed significantly, namely `openai.Embedding.create` no
longer exists.
https://github.com/openai/openai-python/discussions/742

Update the OpenAI embedding function and put a minimum on the openai sdk
version.
2024-01-12 09:45:29 +01:00
Chang She
dc8d6835c0 feat: add timezone handling for datetime in pydantic (#578)
If you add timezone information in the Field annotation for a datetime
then that will now be passed to the pyarrow data type.

I'm not sure how pyarrow enforces timezones, right now, it silently
coerces to the timezone given in the column regardless of whether the
input had the matching timezone or not. This is probably not the right
behavior. Though we could just make it so the user has to make the
pydantic model do the validation instead of doing that at the pyarrow
conversion layer.
2024-01-12 09:45:29 +01:00
Chang She
f55d99cec5 feat(python): add post filtering for full text search (#739)
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-01-12 09:45:29 +01:00
Aidan
3d8b2f5531 fix: createIndex index cache size (#741) 2024-01-12 09:45:29 +01:00
Chang She
b71aa4117f feat(python): support list of list fields from pydantic schema (#747)
For object detection, each row may correspond to an image and each image
can have multiple bounding boxes of x-y coordinates. This means that a
`bbox` field is potentially "list of list of float". This adds support
in our pydantic-pyarrow conversion for nested lists.
2024-01-12 09:45:29 +01:00
Lance Release
55db26f59a Updating package-lock.json 2024-01-12 09:45:29 +01:00
Lance Release
7e42f58dec [python] Bump version: 0.4.0 → 0.4.1 2024-01-12 09:45:23 +01:00
Lance Release
2790b19279 Bump version: 0.4.0 → 0.4.1 2024-01-12 09:45:23 +01:00
elliottRobinson
4ba655d05e Update default_embedding_functions.md (#744)
Modify some grammar, punctuation, and spelling errors.
2024-01-12 09:45:23 +01:00
Andrew Miracle
821cf0e434 eslint fix 2024-01-09 16:27:22 +01:00
Andrew Miracle
ee1d0b596f remove console logs 2023-12-25 21:51:02 +00:00
Andrew Miracle
38a4524893 add support for openai SDK version ^4.24.1 2023-12-25 20:29:54 +00:00
Will Jones
ee0f0611d9 docs: update node API reference (#734)
This command hasn't been run for a while...
2023-12-22 10:14:31 -08:00
Will Jones
34966312cb docs: enhance Update user guide (#735)
Closes #705
2023-12-22 10:14:21 -08:00
Bert
756188358c docs: fix JS api docs for update method (#738) 2023-12-21 13:48:00 -05:00
Weston Pace
dc5126d8d1 feat: add the ability to create scalar indices (#679)
This is a pretty direct binding to the underlying lance capability
2023-12-21 09:50:10 -08:00
Aidan
50c20af060 feat: node list tables pagination (#733) 2023-12-21 11:37:19 -05:00
Chang She
0965d7dd5a doc(javascript): minor improvement on docs for working with tables (#736)
Closes #639 
Closes #638
2023-12-20 20:05:22 -08:00
Chang She
7bbb2872de bug(python): fix path handling in windows (#724)
Use pathlib for local paths so that pathlib
can handle the correct separator on windows.

Closes #703

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2023-12-20 15:41:36 -08:00
Will Jones
e81d2975da chore: add issue templates (#732)
This PR adds issue templates, which help two recurring issues:

* Users forget to tell us whether they are using the Node or Python SDK
* Issues don't get appropriate tags

This doesn't force the use of the templates. Because we set
`blank_issues_enabled: true`, users can still create a custom issue.
2023-12-20 15:15:24 -08:00
Will Jones
2c7f96ba4f ci: check formatting and clippy (#730) 2023-12-20 13:37:51 -08:00
Will Jones
f9dd7a5d8a fix: prevent duplicate data in FTS index (#728)
This forces the user to replace the whole FTS directory when re-creating
the index, prevent duplicate data from being created. Previously, the
whole dataset was re-added to the existing index, duplicating existing
rows in the index.

This (in combination with lancedb/lance#1707) caused #726, since the
duplicate data emitted duplicate indices for `take()` and an upstream
issue caused those queries to fail.

This solution isn't ideal, since it makes the FTS index temporarily
unavailable while the index is built. In the future, we should have
multiple FTS index directories, which would allow atomic commits of new
indexes (as well as multiple indexes for different columns).

Fixes #498.
Fixes #726.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2023-12-20 13:07:07 -08:00
Will Jones
1d4943688d upgrade lance to v0.9.1 (#727)
This brings in some important bugfixes related to take and aarch64
Linux. See changes at:
https://github.com/lancedb/lance/releases/tag/v0.9.1
2023-12-20 13:06:54 -08:00
Chang She
7856a94d2c feat(python): support nested reference for fts (#723)
https://github.com/lancedb/lance/issues/1739

Support nested field reference in full text search

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2023-12-20 12:28:53 -08:00