# Full-Text Search

NOTE

This guide is only applicable to the DB version 1.5.0 or higher.

Traditional database textual search capabilities with operators like LIKE and ILIKE, have been fundamental for years. However, they fall short in addressing the needs of modern information retrieval systems due to several limitations:

  • Lack of Linguistic Support: Traditional methods struggle with language nuances, failing to recognize word derivations (e.g., "satisfies" vs. "satisfy"), which can lead to incomplete or inaccurate search results. Although it's possible to manually search for variations using OR, this approach is cumbersome and error-prone.
  • Absence of Result Ranking: Without the ability to rank search results, sifting through thousands of matches becomes inefficient.
  • Performance Issues: The lack of index support means every document must be processed for each search, leading to slow performance.

To overcome these challenges, MyScale introduces a new index type called FTS Index (Full-Text Search Index), powered by the Tantivy (opens new window) library—a high-performance, open-source full-text search engine library. The FTS Index supports the BM25 indexing algorithm, enabling efficient and relevant search results. This integration enhances MyScale's full-text search functionality and improves overall performance.

# Tutorial Overview

This tutorial guides you through conducting three types of search experiments with FTS indexes:

Illustration of FTS Index in MyScale

Before starting, ensure you have a MyScale cluster set up. For setup instructions, refer to our Quickstart Guide (opens new window).

# Dataset Overview

We'll use the Wikipedia abstract dataset (opens new window), containing over 5.6 million records, available in the Parquet format. This dataset will be directly imported into MyScale from S3, eliminating the need for local downloads.

The table below briefly describes the content of this dataset.

id body title url
... ... ... ...
77 Jake Rodkin is an American .... and Puzzle Agent. Jake Rodkin https://en.wikipedia.org/wiki/Jake_Rodkin (opens new window)
78 Friedlandpreis der Heimkehrer is ... of Germany. Friedlandpreis der Heimkehrer https://en.wikipedia.org/wiki/Friedlandpreis_der_Heimkehrer (opens new window)
... ... ... ...

# Creating and Populating the Table

Create the en_wiki_abstract table in MyScale using the following SQL command:

CREATE TABLE default.en_wiki_abstract(
    `id` UInt64,
    `body` String,
    `title` String,
    `url` String,
)
ENGINE = MergeTree
ORDER BY id;

Then, import the dataset from S3. Please wait patiently for the data import to complete.

INSERT INTO default.en_wiki_abstract SELECT * FROM s3('https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/wiki_abstract_5m.parquet','Parquet');

Verify that the table contains 5,648,453 rows of data.

SELECT count(*) FROM default.en_wiki_abstract;

Output:

count()
5648453

To improve search performance, we can optimize the table by consolidating it into a single data part. This step is optional.

OPTIMIZE TABLE default.en_wiki_abstract FINAL;

Run the following SQL statement to check if the data in this table has been compressed into one part.

SELECT COUNT(*) FROM system.parts
WHERE table = 'en_wiki_abstract' AND active = 1;

If the data is compressed to 1, this SQL statement will return the following result set:

count()
1

# Understanding FTS Index Parameters

MyScale supports various tokenizers, each suited for different scenarios. When creating an FTS index, you can customize it with a JSON configuration for the tokenizer.

TIP

Please provide a valid JSON string in the parameter of FTS index.

-- Example 1: Create an index without specifying parameters, using default configuration
ALTER TABLE [table_name] ADD INDEX [index_name] [column_name]
TYPE fts;
-- Example 2: Create an index using the default tokenizer
ALTER TABLE [table_name] ADD INDEX [index_name] [column_name]
TYPE fts('{"[column_name]":{"tokenizer":{"type":"default"}}}');
-- Example 3: Use the stem tokenizer and apply stop words
ALTER TABLE [table_name] ADD INDEX [index_name] [column_name]
TYPE fts('{"[column_name]":{"tokenizer":{"type":"stem", "stop_word_filters":["english"]}}}');

The table below lists the types of tokenizers supported by the FTS index.

Type of Tokenizer Description
default Default tokenizer, splits text on non-alphabetic characters, case-insensitive
raw Raw tokenizer, performs no tokenization on the text, treats the entire text as a single token
simple Simple tokenizer, splits text on non-alphabetic characters
stem Stemming tokenizer, supports multiple languages, converts words to their stem form, can ignore word tenses
whitespace Whitespace tokenizer, splits text on whitespace characters (spaces, tabs, newlines, etc.)
ngram N-gram tokenizer, splits text based on specified n-gram length
chinese Chinese tokenizer, performs tokenization on Chinese text, internally uses the jieba tokenization library

# Common Tokenizer Parameters

The raw and stem tokenizers only support the store_doc parameter, while the other tokenizers support following common parameters.

Parameter Name Type Default Value Description
store_doc boolean false Whether to store the original document, currently not recommended to enable
length_limit number 40 Maximum length of tokenized tokens
case_sensitive boolean false Whether tokenization is case-sensitive

Most tokenizers support additional parameters in addition to the common ones mentioned above.

# simplestemwhitespace tokenizers

Parameter Name Type Default Value Description
stop_word_filters string[] [] Stop word filters, specifies which languages to discard stop words during tokenization, all valid languages include ["danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish"]

# stem tokenizer

Parameter Name Type Default Value Description
stem_languages string[] [] Languages used for stemming, for English, it can ignore word tenses during tokenization, supported languages for the stem tokenizer are ["arabic", "danish", "dutch", "english", "finnish", "french", "german", "greek", "hungarian", "italian", "norwegian", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", "turkish"]

# ngram tokenizer

Parameter Name Type Default Value Description
min_gram number 2 Minimum number of grams
max_gram number 3 Maximum number of grams
prefix_only boolean false Whether to only extract n-grams from the prefix of words

# chinese tokenizer

For the Chinese tokenizer, cang-jie (opens new window) is used as the underlying implementation.

Parameter Name Type Default Value Description
jieba string "default" "default" means using the jieba dictionary, "empty" means not using the built-in jieba dictionary for tokenization. Valid values are "default" or "empty"
mode string "search" Chinese tokenization mode, valid values are "all", "default", "search", or "unicode", the differences between each mode can be referred to cang-jie/options.rs (opens new window)
hmm boolean false Whether to enable HMM

The above provides a detailed description of the parameters for MyScale FTS index tokenizers. When creating an FTS index, you can select the tokenizer type that best suits your needs and configure its parameters to enhance search performance and results.

# Creating an FTS Index

Customize your FTS index with the appropriate tokenizer to optimize search performance. For example, using the stem tokenizer with English stop words can improve search accuracy by focusing on the root form of words.

ALTER TABLE default.en_wiki_abstract
ADD INDEX body_idx (body)
TYPE fts('{"body":{"tokenizer":{"type":"stem", "stop_word_filters":["english"]}}}');

Normally the FTS index (similar to skip indexes in ClickHouse) are only applied on newly inserted data, so just adding the index won't affect existing data. To index already existing data, use this statement:

ALTER TABLE default.en_wiki_abstract MATERIALIZE INDEX body_idx;

# Searching with Document BM25 Ranking

Note

The first execution of TextSearch() might be slower because it needs to load the FTS index.

The following example shows how to use the TextSearch() function. The example returns the top 10 most relevant documents to "non-profit institute in Washington". The metric for measuring relevance is the BM25 score returned by the TextSearch() function - the higher the score, the more relevant it is.

SELECT
    id,
    title,
    body,
    TextSearch(body, 'non-profit institute in Washington') AS score
FROM default.en_wiki_abstract
ORDER BY score DESC
LIMIT 5;

From the results, we can see that the text in each row of the body column is related to the search phrase "non-profit institute in Washington".

id title body score
3400768 Drug Strategies Drug Strategies is a non-profit research institute located in Washington D.C. 24.457561
872513 Earth Policy Institute Earth Policy Institute was an independent non-profit environmental organization based in Washington, D.C. 22.730673
895248 Arab American Institute Founded in 1985, the Arab American Institute is a non-profit membership organization based in Washington D.C. 21.955559
1950599 Environmental Law Institute The Environmental Law Institute (ELI) is a non-profit, non-partisan organization, headquartered in Washington, D.C. 21.231567
2351478 Public Knowledge Public Knowledge is a non-profit Washington, D.C. 20.742344

# Leveraging Natural Language Queries

MyScale employs the Tantivy library for full-text search (FTS) indexing, enabling support for complex natural language queries. For further information, refer to the official documentation (opens new window).

Here is an example of a multi-condition combination query using AND and OR. We want to search for Anthropological topics related to New York City, Africa, or Paris. The SQL query would be:

SELECT
    id,
    title,
    body,
    TextSearch(body, 'Anthropological AND ("New York City" OR African OR Paris)') AS score
FROM default.en_wiki_abstract
ORDER BY score DESC
LIMIT 5;

The search results show that each line of text contains the word "Anthropological" (case-insensitive), satisfying the condition on the left side of the AND statement. At the same time, each result contains at least one of "New York City", "African" or "Paris", meeting the matching condition on the right side of the AND statement.

id title body score
2826913 African Anthropologist African Anthropologist is the journal of the Pan African Anthropological Association (PAAA). 20.131313
3905943 Tunnel People Tunnel People is an anthropological-journalistic account describing an underground homeless community in New York City. It is written by war photographer and anthropologist Teun Voeten and was published by PM Press in 2010. 13.759308
3790627 Les Accords de Bella Les Accords de Bella is a 2007 anthropological documentary film directed by David Constantin. It was selected by the African Film Festival of Cordoba - FCAT. 12.769518
4488199 Naparay Naparay, in African anthropological study, is a non-linear conception of human life held by some West African peoples such as the Yoruba. Similar to reincarnation, naparay holds that lives are cyclic and attributes of previous lives may carry over to a new life. 11.682068
1725559 Gradhiva Gradhiva is an anthropological and museological journal, founded in 1986 by the poet and social scientist Michel Leiris and by the anthropologist Jean Jamin. It is since 2005 published by the Musée du Quai Branly in Paris. 11.135916

# Utilizing SkipIndex for Searching in Strings

When using the following ClickHouse built-in functions for string searches, they will all be filtered through the FTS index: equals, notEquals, like, notLike, hasToken, hasTokenOrNull, in, notIn,startsWith,endsWith,has,mapContains,multiSearchAny.

Example - 1 Perform a simple keyword search, such as searching for the single word 'Tsinghua'.

SELECT count(*)
FROM default.en_wiki_abstract
WHERE hasToken(body, 'Tsinghua');

Output:

count()
81

Example - 2 Perform a multi-keyword search, finding articles that contain Eiffel Tower but not Paris.

SELECT count(*)
FROM default.en_wiki_abstract
WHERE (NOT hasToken(body, 'Paris')) AND multiSearchAny(body, ['Eiffel', 'Tower'])

Output:

count()
2828

Example - 3 Perform a complex conditional combination query, combining different names, places, disciplines and other terms.

SELECT count(*)
FROM default.en_wiki_abstract
WHERE (NOT multiSearchAny(body, ['Montessori', 'postulated', 'Rooney'])) AND (hasToken(body, 'Patsy') OR hasToken(body, 'Anthropological'))

Output:

count()
204

# Conclusion

This guide has walked you through leveraging MyScale for advanced text search capabilities, from setting up FTS indexes to executing natural language queries. By applying these techniques, you can efficiently manage and search through unstructured text data, showcasing MyScale's robust text processing capabilities.

Last Updated: Mon Apr 29 2024 08:06:28 GMT+0000