# Embedding Functions
NOTE
This feature is available starting from version 1.2.1.
The EmbedText
function integrates embedding API interfaces from various providers like Amazon Bedrock, Amazon SageMaker, Cohere, Gemini, HuggingFace, Jina AI, OpenAI, and Voyage AI, streamlining the conversion of text into vectors. It supports automatic batching for high throughput, and it is useful for both real-time search and batch processing.
Syntax
EmbedText(text, provider, base_url, api_key, others)
Arguments
text
(String
): A non-empty string that will be converted into a vector.provider
(String
): The embedding model provider. Must be one of the following, case-insensitive:OpenAI
,HuggingFace
,Cohere
,VoyageAI
,Bedrock
,SageMaker
,Jina
,Gemini
.base_url
(String
): The URL of the embedding API. This parameter is optional for some providers.api_key
(String
): Embedding Provider API key.others
(String
): Optional additional parameters for the provider embedding API request. It should be provided as a JSON map and can include:batch_size
: The maximum number of texts that can be included in each API request varies depending on the embedding model used. By default, this size is set based on the specific model's capabilities and limitations. When theEmbedText
function operates in batch mode, it automatically consolidates multiple texts into one batch. This aggregation process is done internally by the function before the data is sent to the embedding API.- Additional provider-specific parameters, as detailed in their respective API documentation.
Returned value
- The function returns a vector converted from the input text. This vector is an array of
Float32
values, representing the numerical embedding of the text as processed by the selected provider's Embedding API. - Type:
Array(Float32)
.
# Amazon Bedrock Embedding
Setting the provider
parameter to Bedrock
in EmbedText
uses the Amazon Bedrock Titan Embedding API (opens new window) for text embedding.
Provider-specific parameters
base_url
: Not applicable for this provider.api_key
: AWS secret_access_key. Required.others
:batch_size
: Not relevant, as batch embedding is not supported this provider.model
: Model ID to use. Required.access_key_id
: AWS access_key_id. Required.region_name
: AWS region name. Required.
Examples
SELECT EmbedText('YOUR_TEXT', 'Bedrock', '', 'SECRET_ACCESS_KEY', '{"model":"amazon.titan-embed-text-v1", "region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID"}')
Simplified usage with custom function:
CREATE FUNCTION BedrockEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'Bedrock', '', 'SECRET_ACCESS_KEY', '{"model":"amazon.titan-embed-text-v1", "region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID"}')
SELECT BedrockEmbedText('YOUR_TEXT')
# Amazon SageMaker Embedding
Setting the provider
parameter to SageMaker
in EmbedText
uses the Amazon SageMaker Endpoints (opens new window) for text embedding.
Note: This provider is specifically designed for models deployed on Amazon SageMaker with particular input and output formats. The expected input format for the embedding API is a JSON object with "input_name" as either a single text or a list of texts. The API response is structured as {"output_name": output}, where 'output' is either a single embedding vector or a list of vectors, depending on whether the input is a single text or a list.
Locating models that align with these prerequisites is straightforward in SageMaker JumpStart (opens new window). An example of such models can be seen in the image below:
Provider-specific parameters
base_url
: SageMaker Endpoint name. Required.api_key
: AWS secret_access_key. Required.others
:batch_size
: Maximum number of texts in each API request. Optional, with a default value of 50. Adjust this if batch embedding isn't supported by setting it to 1.access_key_id
: AWS access_key_id. Required.region_name
: AWS region name. Required.input_name
: API input name. Optional. Default value is 'text_inputs'.output_name
: API output name. Optional. Default value is 'embedding'.model_args
: Optional parameters specific to the SageMaker endpoint being used.
Examples Using Default Values:
SELECT EmbedText('YOUR_TEXT', 'SageMaker', 'SAGEMAKER_ENDPOINT', 'SECRET_ACCESS_KEY', '{"region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID", "model_args":{"mode":"embedding"}}')
Using Custom Values:
SELECT EmbedText('YOUR_TEXT', 'SageMaker', 'SAGEMAKER_ENDPOINT', 'SECRET_ACCESS_KEY', '{"region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID", "model_args":{"mode":"embedding"}, "input_name":"inputs", "output_name":"embedding"}')
Simplified usage with custom function:
CREATE FUNCTION SageMakerEmbedText ON CLUSTER '{cluster}' AS (x)-> EmbedText(x, 'SageMaker', 'SAGEMAKER_ENDPOINT', 'SECRET_ACCESS_KEY', '{"region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID", "model_args":{"mode":"embedding"}}')
SELECT SageMakerEmbedText('YOUR_TEXT')
# Cohere Embedding
Setting the provider
parameter to Cohere
in EmbedText
uses the Cohere Embedding API (opens new window) for text embedding.
Provider-specific parameters
base_url
: Cohere Embedding API URL. Optional. Default value is https://api.cohere.ai/v1/embed (opens new window).api_key
: Cohere API Key. Required.others
:batch_size
: Maximum number of texts in each API request. Optional. Default value is 50.model
: Model ID to use. Optional. Default value isembed-english-v2.0
.input_type
: The type of input text. Optional.truncate
: Optional. One ofNONE|START|END
to specify how the API will handle inputs longer than the maximum token length.
Examples
Using Default Values:
SELECT EmbedText('YOUR_TEXT', 'Cohere', '', 'COHERE_API_KEY', '')
Using Custom Values:
SELECT EmbedText('YOUR_TEXT', 'Cohere', 'YOUR_EMBEDDING_API_URL', 'COHERE_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE, "input_type":"search_query", "truncate":"END"}')
Simplified usage with custom function:
CREATE FUNCTION CohereEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'Cohere', '', 'COHERE_API_KEY', '')
SELECT CohereEmbedText('YOUR_TEXT')
# Gemini Embedding
Setting the provider
parameter to Gemini
in EmbedText
uses the Gemini Embedding API (opens new window) for text embedding.
Provider-specific parameters
base_url
: Gemini Embedding API URL. Optional. Default value is https://generativelanguage.googleapis.com/v1beta (opens new window).api_key
: Gemini API Key. Required.others
:batch_size
: Maximum number of texts in each API request. Optional. Default value is 50.model
: Model ID to use. Optional. Default value ismodels/embedding-001
.
Examples
Using Default Values:
SELECT EmbedText('YOUR_TEXT', 'Gemini', '', 'GEMINI_API_KEY', '')
Using Custom Values:
SELECT EmbedText('YOUR_TEXT', 'Gemini', 'YOUR_EMBEDDING_API_URL', 'GEMINI_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE}')
Simplified usage with custom function:
CREATE FUNCTION GeminiEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'Gemini', '', 'GEMINI_API_KEY', '')
SELECT GeminiEmbedText('YOUR_TEXT')
# HuggingFace Embedding
Setting the provider
parameter to HuggingFace
in EmbedText
uses the HuggingFace Inference API/Inference Endpoint (opens new window) for text embedding.
Note: It is specifically compatible with APIs that follow a certain input and output format, like BAAI/BGE embedding (opens new window) APIs. The expected input for the embedding API is a JSON object with "inputs" as either a single text or a list of texts. The response from this API will be an embedding vector or a list of embedding vectors, depending on the input provided. If batch embedding is not supported, it's necessary to set batch_size
to 1 in the others
parameter.
Provider-specific parameters
base_url
: HuggingFace Embedding API URL. Required.api_key
: HuggingFace API Key. Requiredothers
:batch_size
: Maximum number of texts in each API request. Optional. Default value is 32.model_args
: Optional parameters specific to the HuggingFace model being used.
Examples
Using Default Values:
SELECT EmbedText('YOUR_TEXT', 'HuggingFace', 'API_URL', 'HUGGINGFACE_API_KEY', '')
Using Custom Values:
SELECT EmbedText('YOUR_TEXT', 'HuggingFace', 'API_URL', 'HUGGINGFACE_API_KEY', '{"model_args":{"parameters": {"truncation":true}}}')
Simplified usage with custom function:
CREATE FUNCTION HuggingFaceEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'HuggingFace', 'API_URL', 'HUGGINGFACE_API_KEY', '')
SELECT HuggingFaceEmbedText('YOUR_TEXT')
# Jina AI Embedding
Setting the provider
parameter to Jina
in EmbedText
uses the Jina AI Embedding API (opens new window) for text embedding.
base_url
: Jina AI Embedding API URL. Optional. Default value is https://api.jina.ai/v1/embeddings (opens new window).api_key
: Jina AI API Key. Requiredothers
:batch_size
: Maximum number of texts in each API request. Optional. Default value is 50.model
: Model ID to use. Optional. Default value isjina-embeddings-v2-base-en
.
Examples
Using Default Values:
SELECT EmbedText('YOUR_TEXT', 'Jina', '', 'JINAAI_API_KEY', '')
Using Custom Values:
SELECT EmbedText('YOUR_TEXT', 'Jina', 'YOUR_EMBEDDING_API_URL', 'JINAAI_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE}')
Simplified usage with custom function:
CREATE FUNCTION JinaAIEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'Jina', '', 'JINAAI_API_KEY', '')
SELECT JinaAIEmbedText('YOUR_TEXT')
# OpenAI Embedding
Setting the provider
parameter to OpenAI
in EmbedText
uses the OpenAI Embedding API (opens new window) for text embedding.
Provider-specific parameters
base_url
: OpenAI Embedding API URL. Optional. Default value is https://api.openai.com/v1/embeddings (opens new window).api_key
: OpenAI API Key. Required.others
:batch_size
: Maximum number of texts in each API request. Optional. Default value is 50.model
: Model ID to use. Optional. Supported models includetext-embedding-ada-002
,text-embedding-3-small
andtext-embedding-3-large
. Default value istext-embedding-ada-002
for versions prior to 1.3.0, andtext-embedding-3-small
starting from version 1.3.0.dimensions
: The number of dimensions the resulting output embeddings should have. It's optional and has been available since version 1.3.0.user
: An optional unique identifier for your end-user, aiding OpenAI in monitoring and abuse detection.
Examples
Using Default Values:
SELECT EmbedText('YOUR_TEXT', 'OpenAI', '', 'OPENAI_API_KEY', '')
Using Custom Values:
SELECT EmbedText('YOUR_TEXT', 'OpenAI', 'YOUR_EMBEDDING_API_URL', 'OPENAI_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE, "user":"YOUR_USER_ID"}')
Simplified usage with custom function:
CREATE FUNCTION OpenAIEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'OpenAI', '', 'OPENAI_API_KEY', '')
SELECT OpenAIEmbedText('YOUR_TEXT')
# Voyage AI Embedding
Provider-specific parameters
Setting the provider
parameter to VoyageAI
in EmbedText
uses the Voyage AI Embedding API (opens new window) for text embedding.
base_url
: Voyage AI Embedding API URL. Optional. Default value is https://api.voyageai.com/v1/embeddings (opens new window).api_key
: Voyage AI API Key. Required.others
:batch_size
: Maximum number of texts in each API request. Optional. Default value is 8model
: Model ID to use. Optional. Default isvoyage-01
.
Examples
Using Default Values:
SELECT EmbedText('YOUR_TEXT', 'VoyageAI', '', 'VOYAGEAI_API_KEY', '')
Using Custom Values:
SELECT EmbedText('YOUR_TEXT', 'VoyageAI', 'YOUR_EMBEDDING_API_URL', 'VOYAGEAI_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE}')
Simplified usage with custom function:
CREATE FUNCTION VoyageAIEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'VoyageAI', '', 'VOYAGEAI_API_KEY', '')
SELECT VoyageAIEmbedText('YOUR_TEXT')