# Full-Text Search
NOTE
This guide is only applicable to the DB version 1.5.0 or higher.
Traditional database textual search capabilities with operators like LIKE
and ILIKE
, have been fundamental for years. However, they fall short in addressing the needs of modern information retrieval systems due to several limitations:
- Lack of Linguistic Support: Traditional methods struggle with language nuances, failing to recognize word derivations (e.g., "satisfies" vs. "satisfy"), which can lead to incomplete or inaccurate search results. Although it's possible to manually search for variations using
OR
, this approach is cumbersome and error-prone. - Absence of Result Ranking: Without the ability to rank search results, sifting through thousands of matches becomes inefficient.
- Performance Issues: The lack of index support means every document must be processed for each search, leading to slow performance.
To overcome these challenges, MyScale introduces a new index type called FTS Index (Full-Text Search Index), powered by the Tantivy (opens new window) library—a high-performance, open-source full-text search engine library. The FTS Index supports the BM25 indexing algorithm, enabling efficient and relevant search results. This integration enhances MyScale's full-text search functionality and improves overall performance.
# Tutorial Overview
This tutorial guides you through conducting three types of search experiments with FTS indexes:
- FTS Index Creation: Learn how to configure FTS indexes for various use cases.
- Using the
TextSearch()
Function: Discover how to search and rank texts with BM25 scores. - Natural Language Queries: Enhance search logic with natural language connectives like
AND
andOR
. - String Search Functions in MyScale: Explore how MyScale leverages FST indexes for improved string search capabilities.
Before starting, ensure you have a MyScale cluster set up. For setup instructions, refer to our Quickstart Guide (opens new window).
# Dataset Overview
We'll use the Wikipedia abstract dataset (opens new window), containing over 5.6 million records, available in the Parquet format. This dataset will be directly imported into MyScale from S3, eliminating the need for local downloads.
The table below briefly describes the content of this dataset.
id | body | title | url |
---|---|---|---|
... | ... | ... | ... |
77 | Jake Rodkin is an American .... and Puzzle Agent. | Jake Rodkin | https://en.wikipedia.org/wiki/Jake_Rodkin (opens new window) |
78 | Friedlandpreis der Heimkehrer is ... of Germany. | Friedlandpreis der Heimkehrer | https://en.wikipedia.org/wiki/Friedlandpreis_der_Heimkehrer (opens new window) |
... | ... | ... | ... |
# Creating and Populating the Table
Create the en_wiki_abstract
table in MyScale using the following SQL command:
CREATE TABLE default.en_wiki_abstract(
`id` UInt64,
`body` String,
`title` String,
`url` String,
)
ENGINE = MergeTree
ORDER BY id;
Then, import the dataset from S3. Please wait patiently for the data import to complete.
INSERT INTO default.en_wiki_abstract SELECT * FROM s3('https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/wiki_abstract_5m.parquet','Parquet');
Verify that the table contains 5,648,453 rows of data.
SELECT count(*) FROM default.en_wiki_abstract;
Output:
count() |
---|
5648453 |
To improve search performance, we can optimize the table by consolidating it into a single data part. This step is optional.
OPTIMIZE TABLE default.en_wiki_abstract FINAL;
Run the following SQL statement to check if the data in this table has been compressed into one part.
SELECT COUNT(*) FROM system.parts
WHERE table = 'en_wiki_abstract' AND active = 1;
If the data is compressed to 1, this SQL statement will return the following result set:
count() |
---|
1 |
# Understanding FTS Index Parameters
MyScale supports various tokenizers, each suited for different scenarios. When creating an FTS index, you can customize it with a JSON configuration for the tokenizer.
TIP
Please provide a valid JSON string in the parameter of FTS index.
-- Example 1: Create an index without specifying parameters, using default configuration
ALTER TABLE [table_name] ADD INDEX [index_name] [column_name]
TYPE fts;
-- Example 2: Create an index using the default tokenizer
ALTER TABLE [table_name] ADD INDEX [index_name] [column_name]
TYPE fts('{"[column_name]":{"tokenizer":{"type":"default"}}}');
-- Example 3: Use the stem tokenizer and apply stop words
ALTER TABLE [table_name] ADD INDEX [index_name] [column_name]
TYPE fts('{"[column_name]":{"tokenizer":{"type":"stem", "stop_word_filters":["english"]}}}');
The table below lists the types of tokenizers supported by the FTS index.
Type of Tokenizer | Description |
---|---|
default | Default tokenizer, splits text on non-alphabetic characters, case-insensitive |
raw | Raw tokenizer, performs no tokenization on the text, treats the entire text as a single token |
simple | Simple tokenizer, splits text on non-alphabetic characters |
stem | Stemming tokenizer, supports multiple languages, converts words to their stem form, can ignore word tenses |
whitespace | Whitespace tokenizer, splits text on whitespace characters (spaces, tabs, newlines, etc.) |
ngram | N-gram tokenizer, splits text based on specified n-gram length |
chinese | Chinese tokenizer, performs tokenization on Chinese text, internally uses the jieba tokenization library |
# Common Tokenizer Parameters
The raw
and stem
tokenizers only support the store_doc
parameter, while the other tokenizers support following common parameters.
Parameter Name | Type | Default Value | Description |
---|---|---|---|
store_doc | boolean | false | Whether to store the original document, currently not recommended to enable |
length_limit | number | 40 | Maximum length of tokenized tokens |
case_sensitive | boolean | false | Whether tokenization is case-sensitive |
Most tokenizers support additional parameters in addition to the common ones mentioned above.
# simple
、stem
、whitespace
tokenizers
Parameter Name | Type | Default Value | Description |
---|---|---|---|
stop_word_filters | string[] | [] | Stop word filters, specifies which languages to discard stop words during tokenization, all valid languages include ["danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish"] |
# stem
tokenizer
Parameter Name | Type | Default Value | Description |
---|---|---|---|
stem_languages | string[] | [] | Languages used for stemming, for English, it can ignore word tenses during tokenization, supported languages for the stem tokenizer are ["arabic", "danish", "dutch", "english", "finnish", "french", "german", "greek", "hungarian", "italian", "norwegian", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", "turkish"] |
# ngram
tokenizer
Parameter Name | Type | Default Value | Description |
---|---|---|---|
min_gram | number | 2 | Minimum number of grams |
max_gram | number | 3 | Maximum number of grams |
prefix_only | boolean | false | Whether to only extract n-grams from the prefix of words |
# chinese
tokenizer
For the Chinese tokenizer, cang-jie (opens new window) is used as the underlying implementation.
Parameter Name | Type | Default Value | Description |
---|---|---|---|
jieba | string | "default" | "default" means using the jieba dictionary, "empty" means not using the built-in jieba dictionary for tokenization. Valid values are "default" or "empty" |
mode | string | "search" | Chinese tokenization mode, valid values are "all", "default", "search", or "unicode", the differences between each mode can be referred to cang-jie/options.rs (opens new window) |
hmm | boolean | false | Whether to enable HMM |
The above provides a detailed description of the parameters for MyScale FTS index tokenizers. When creating an FTS index, you can select the tokenizer type that best suits your needs and configure its parameters to enhance search performance and results.
# Creating an FTS Index
Customize your FTS index with the appropriate tokenizer to optimize search performance. For example, using the stem
tokenizer with English stop words can improve search accuracy by focusing on the root form of words.
ALTER TABLE default.en_wiki_abstract
ADD INDEX body_idx (body)
TYPE fts('{"body":{"tokenizer":{"type":"stem", "stop_word_filters":["english"]}}}');
Normally the FTS index (similar to skip indexes in ClickHouse) are only applied on newly inserted data, so just adding the index won't affect existing data. To index already existing data, use this statement:
ALTER TABLE default.en_wiki_abstract MATERIALIZE INDEX body_idx;
# Searching with Document BM25 Ranking
Note
The first execution of TextSearch()
might be slower because it needs to load the FTS index.
The following example shows how to use the TextSearch()
function. The example returns the top 10 most relevant documents to "non-profit institute in Washington". The metric for measuring relevance is the BM25 score returned by the TextSearch()
function - the higher the score, the more relevant it is.
SELECT
id,
title,
body,
TextSearch(body, 'non-profit institute in Washington') AS score
FROM default.en_wiki_abstract
ORDER BY score DESC
LIMIT 5;
From the results, we can see that the text in each row of the body column is related to the search phrase "non-profit institute in Washington".
id | title | body | score |
---|---|---|---|
3400768 | Drug Strategies | Drug Strategies is a non-profit research institute located in Washington D.C. | 24.457561 |
872513 | Earth Policy Institute | Earth Policy Institute was an independent non-profit environmental organization based in Washington, D.C. | 22.730673 |
895248 | Arab American Institute | Founded in 1985, the Arab American Institute is a non-profit membership organization based in Washington D.C. | 21.955559 |
1950599 | Environmental Law Institute | The Environmental Law Institute (ELI) is a non-profit, non-partisan organization, headquartered in Washington, D.C. | 21.231567 |
2351478 | Public Knowledge | Public Knowledge is a non-profit Washington, D.C. | 20.742344 |
# Leveraging Natural Language Queries
MyScale employs the Tantivy library for full-text search (FTS) indexing, enabling support for complex natural language queries. For further information, refer to the official documentation (opens new window).
Here is an example of a multi-condition combination query using AND
and OR
. We want to search for Anthropological topics related to New York City, Africa, or Paris. The SQL query would be:
SELECT
id,
title,
body,
TextSearch(body, 'Anthropological AND ("New York City" OR African OR Paris)') AS score
FROM default.en_wiki_abstract
ORDER BY score DESC
LIMIT 5;
The search results show that each line of text contains the word "Anthropological" (case-insensitive), satisfying the condition on the left side of the AND statement. At the same time, each result contains at least one of "New York City", "African" or "Paris", meeting the matching condition on the right side of the AND statement.
id | title | body | score |
---|---|---|---|
2826913 | African Anthropologist | African Anthropologist is the journal of the Pan African Anthropological Association (PAAA). | 20.131313 |
3905943 | Tunnel People | Tunnel People is an anthropological-journalistic account describing an underground homeless community in New York City. It is written by war photographer and anthropologist Teun Voeten and was published by PM Press in 2010. | 13.759308 |
3790627 | Les Accords de Bella | Les Accords de Bella is a 2007 anthropological documentary film directed by David Constantin. It was selected by the African Film Festival of Cordoba - FCAT. | 12.769518 |
4488199 | Naparay | Naparay, in African anthropological study, is a non-linear conception of human life held by some West African peoples such as the Yoruba. Similar to reincarnation, naparay holds that lives are cyclic and attributes of previous lives may carry over to a new life. | 11.682068 |
1725559 | Gradhiva | Gradhiva is an anthropological and museological journal, founded in 1986 by the poet and social scientist Michel Leiris and by the anthropologist Jean Jamin. It is since 2005 published by the Musée du Quai Branly in Paris. | 11.135916 |
# Utilizing SkipIndex for Searching in Strings
When using the following ClickHouse built-in functions for string searches, they will all be filtered through the FTS index: equals
, notEquals
, like
, notLike
, hasToken
, hasTokenOrNull
, in
, notIn
,startsWith
,endsWith
,has
,mapContains
,multiSearchAny
.
Example - 1 Perform a simple keyword search, such as searching for the single word 'Tsinghua'.
SELECT count(*)
FROM default.en_wiki_abstract
WHERE hasToken(body, 'Tsinghua');
Output:
count() |
---|
81 |
Example - 2 Perform a multi-keyword search, finding articles that contain Eiffel Tower but not Paris.
SELECT count(*)
FROM default.en_wiki_abstract
WHERE (NOT hasToken(body, 'Paris')) AND multiSearchAny(body, ['Eiffel', 'Tower'])
Output:
count() |
---|
2828 |
Example - 3 Perform a complex conditional combination query, combining different names, places, disciplines and other terms.
SELECT count(*)
FROM default.en_wiki_abstract
WHERE (NOT multiSearchAny(body, ['Montessori', 'postulated', 'Rooney'])) AND (hasToken(body, 'Patsy') OR hasToken(body, 'Anthropological'))
Output:
count() |
---|
204 |
# Conclusion
This guide has walked you through leveraging MyScale for advanced text search capabilities, from setting up FTS indexes to executing natural language queries. By applying these techniques, you can efficiently manage and search through unstructured text data, showcasing MyScale's robust text processing capabilities.