Weaviate
Overview
This page guides you through the process of setting up the Weaviate destination connector.
There are three parts to this:
- Processing - split up individual records in chunks so they will fit the context window and decide which fields to use as context and which are supplementary metadata.
- Embedding - convert the text into a vector representation using a pre-trained model (Currently, OpenAI's
text-embedding-ada-002
and Cohere'sembed-english-light-v2.0
are supported.) - Indexing - store the vectors in a vector database for similarity search
Prerequisites
To use the Weaviate destination, you'll need:
- Access to a running Weaviate instance (either self-hosted or via Weaviate Cloud Services), minimum version 1.21.2
- Either
- An account with API access for OpenAI or Cohere (depending on which embedding method you want to use)
- Pre-calculated embeddings stored in a field in your source database
You'll need the following information to configure the destination:
- Embedding service API Key - The API key for your OpenAI or Cohere account
- Weaviate cluster URL - The URL of the Weaviate cluster to load data into. Airbyte Cloud only supports connecting to your Weaviate Instance instance with TLS encryption.
- Weaviate credentials - The credentials for your Weaviate instance (either API token or username/password)
Features
Feature | Supported?(Yes/No) | Notes |
---|---|---|
Full Refresh Sync | Yes | |
Incremental - Append Sync | Yes | |
Incremental - Append + Deduped | Yes | Deleting records via CDC is not supported (see issue #29827) |
Namespaces | No | |
Provide vector | Yes | Either from field are calculated during the load process |
Data type mapping
All fields specified as metadata fields will be stored as properties in the object can be used for filtering. The following data types are allowed for metadata fields:
- String
- Number (integer or floating point, gets converted to a 64 bit floating point)
- Booleans (true, false)
- List of String
All other fields are serialized into their JSON representation.
Configuration
Processing
Each record will be split into text fields and metadata fields as configured in the "Processing" section. All text fields are concatenated into a single string and then split into chunks of configured length. If specified, the metadata fields are stored as-is along with the embedded text chunks. Please note that metadata fields can only be used for filtering and not for retrieval and have to be of type string, number, boolean (all other values are ignored). Please note that there's a 40kb limit on the total size of the metadata saved for each entry.
When specifying text fields, you can access nested fields in the record by using dot notation, e.g. user.name
will access the name
field in the user
object. It's also possible to use wildcards to access all fields in an object, e.g. users.*.name
will access all names
fields in all entries of the users
array.
The chunk length is measured in tokens produced by the tiktoken
library. The maximum is 8191 tokens, which is the maximum length supported by the text-embedding-ada-002
model.
The stream name gets added as a metadata field _ab_stream
to each document. If available, the primary key of the record is used to identify the document to avoid duplications when updated versions of records are indexed. It is added as the _ab_record_id
metadata field.
Embedding
The connector can use one of the following embedding methods:
OpenAI - using OpenAI API , the connector will produce embeddings using the
text-embedding-ada-002
model with 1536 dimensions. This integration will be constrained by the speed of the OpenAI embedding API.Cohere - using the Cohere API, the connector will produce embeddings using the
embed-english-light-v2.0
model with 1024 dimensions.From field - if you have pre-calculated embeddings stored in a field in your source database, you can use the
From field
integration to load them into Weaviate. The field must be a JSON array of numbers, e.g.[0.1, 0.2, 0.3]
.No embedding - if you don't want to use embeddings or have configured a vectorizer for your class, you can use the
No embedding
integration.
For testing purposes, it's also possible to use the Fake embeddings integration. It will generate random embeddings and is suitable to test a data pipeline without incurring embedding costs.
Indexing
All streams will be indexed into separate classes derived from the stream name. If a class doesn't exist in the schema of the cluster, it will be created using the configure vectorizer configuration. In this case, dynamic schema has to be enabled on the server.
You can also create the class in Weaviate in advance if you need more control over the schema in Weaviate. In this case, the text properies _ab_stream
and _ab_record_id
need to be created for bookkeeping reasons. In case a sync is run in Overwrite
mode, the class will be deleted and recreated.
As properties have to start will a lowercase letter in Weaviate, field names might be updated during the loading process. The field names id
, _id
and _additional
are reserved keywords in Weaviate, so they will be renamed to raw_id
, raw__id
and raw_additional
respectively.
Build instructions
Build your own connector image
This connector is built using our dynamic built process.
The base image used to build it is defined within the metadata.yaml file under the connectorBuildOptions
.
The build logic is defined using Dagger here.
It does not rely on a Dockerfile.
If you would like to patch our connector and build your own a simple approach would be:
- Create your own Dockerfile based on the latest version of the connector image.
FROM airbyte/destination-weaviate:latest
COPY . ./airbyte/integration_code
RUN pip install ./airbyte/integration_code
# The entrypoint and default env vars are already set in the base image
# ENV AIRBYTE_ENTRYPOINT "python /airbyte/integration_code/main.py"
# ENTRYPOINT ["python", "/airbyte/integration_code/main.py"]
Please use this as an example. This is not optimized.
- Build your image:
docker build -t airbyte/destination-weaviate:dev .
# Running the spec command against your patched connector
docker run airbyte/destination-weaviate:dev spec
Customizing our build process
When contributing on our connector you might need to customize the build process to add a system dependency or set an env var.
You can customize our build process by adding a build_customization.py
module to your connector.
This module should contain a pre_connector_install
and post_connector_install
async function that will mutate the base image and the connector container respectively.
It will be imported at runtime by our build process and the functions will be called if they exist.
Here is an example of a build_customization.py
module:
from __future__ import annotations
from typing import TYPE_CHECKING
if TYPE_CHECKING:
# Feel free to check the dagger documentation for more information on the Container object and its methods.
# https://dagger-io.readthedocs.io/en/sdk-python-v0.6.4/
from dagger import Container
async def pre_connector_install(base_image_container: Container) -> Container:
return await base_image_container.with_env_variable("MY_PRE_BUILD_ENV_VAR", "my_pre_build_env_var_value")
async def post_connector_install(connector_container: Container) -> Container:
return await connector_container.with_env_variable("MY_POST_BUILD_ENV_VAR", "my_post_build_env_var_value")
Changelog
Version | Date | Pull Request | Subject |
---|---|---|---|
0.2.2 | 2023-10-13 | 31377 | Use our base image and remove Dockerfile |
0.2.1 | 2023-10-04 | #31075 | Fix OpenAI embedder batch size and conflict field name handling |
0.2.0 | 2023-09-22 | #30151 | Add embedding capabilities, overwrite and dedup support and API key auth mode, make certified. 🚨 Breaking changes - check migrations guide. |
0.1.1 | 2022-02-08 | #22527 | Multiple bug fixes: Support String based IDs, arrays of uknown type and additionalProperties of type object and array of objects |
0.1.0 | 2022-12-06 | #20094 | Add Weaviate destination |