# Quickstart

This guide will show you how to launch a cluster, import data, and execute SQL queries in just a few simple steps. For more information on other developer tools, see the Developer Tools for more information.

# Launching a Cluster

Launch a first cluster

Before you can perform any data operations, you need to launch a cluster. The Clusters page lets you configure computing and storage resources to meet your requirements. Follow these steps to launch a new cluster:

  1. Go to the Clusters page and click the "+ New Cluster" to launch a new cluster.
  2. Enter a preferred cluster name, such as "default".
  3. The current version of MyScale is limited to the default configuration and does not support multiple replicas. If you need a more powerful configuration, please refer to the Support & Help section and get in touch with us.
  4. Click 'Launch' to complete the cluster launch.

# Environment Setup

# Python

Before getting started with python, you need to install Clickhouse client (opens new window).

$ pip install clickhouse-connect

To connect to your cluster with python, you'll need to provide your cluster host, username, and password. Once the cluster creation is complete, you can find the Connection Details button in the Action column of your cluster list.

Actions

After clicking on the Connection Details button, you will see the necessary code required to access MyScale. Simply click on the copy icon to copy the corresponding code.

Python

For more guidance on obtaining the necessary information to connect to the cluster, please refer to the Connection Details.

# MyScale Console

Navigate to the "SQL Workspace" page. Your cluster will be selected automatically, as shown below:

Enter the SQL Workspace page

# Importing Data

To import data into MyScale, follow these three steps:

  1. Create a table
  2. Insert data into the table
  3. Build a vector index

# Creating a Table

Tables must be created in MyScale before you can import data. To learn more, see the "Create Database" and "Create Table" sub-sections in the SQL Reference section.

Create a new table named default.myscale_categorical_search.

  • Python
  • SQL
# Create a table with 128 dimensional vectors.
client.command("""
CREATE TABLE default.myscale_categorical_search
(
    id    UInt32,
    data  Array(Float32),
    CONSTRAINT check_length CHECK length(data) = 128,
    date  Date,
    label Enum8('person' = 1, 'building' = 2, 'animal' = 3)
)
ENGINE = MergeTree ORDER BY id""")

# Inserting Data

MyScale currently supports data import from AWS S3 and other cloud services with S3-compatible APIs. Find out more import options here.

Use SQL to import data into the default.myscale_categorical_search table. Supported file formats include CSV (opens new window), CSVWithNames (opens new window), JSONEachRow (opens new window), and Parquet (opens new window). Please see Formats for Input and Output Data (opens new window) for detailed description of all supported formats.

  • Python
  • SQL
client.command("""
INSERT INTO default.myscale_categorical_search
    SELECT * FROM s3(
        'https://d3lhz231q7ogjd.cloudfront.net/sample-datasets/quick-start/categorical-search.csv',
        'CSVWithNames',
        'id UInt32, data Array(Float32), date Date, label Enum8(''person'' = 1, ''building'' = 2, ''animal'' = 3)'
    )""")

# Building a Vector Index

In addition to creating traditional indexes on structured data, you can also create a Vector Index in MyScale for vector embeddings.

  1. Create an MSTG vector index and check vector index build status:
  • Python
  • SQL
client.command("""
ALTER TABLE default.myscale_categorical_search
    ADD VECTOR INDEX categorical_vector_idx data
    TYPE MSTG
""")

The build time of the index will depend on the size of your data import. Further status of vector indexes can be found in Vector Reference Section

  1. Use SQL to check vector index build status:
  • Python
  • SQL
# Query the 'vector_indices' system table to check the status of the index creation.
get_index_status="SELECT status FROM system.vector_indices WHERE table='myscale_categorical_search'"
# Print the status of the index creation.  The status will be 'Built' if the index was created successfully.
print(f"index build status is {client.command(get_index_status)}")

Output:

  • Python
  • SQL
index build status is Built

# Executing SQL Queries

After importing data into a MyScale table, queries may be performed on the data within the table. In no time, you’ll be able to experience the quick query speeds attainable with MyScale.

Conventionally, queries are made using text or images, such as “a blue car” or images of a blue car. However, MyScale treats all queries as vectors, and gives a response to the query based on the similarity (“distance“) between the query and the existing data within the table.

To retrieve data using a vector as the query, use this SQL command as below, which returns the 10 most similar results:

  • Python
  • SQL
 # pick a random row from the table as the target
random_row = client.query("SELECT * FROM default.myscale_categorical_search ORDER BY rand() LIMIT 1")
assert random_row.row_count == 1
target_row_id = random_row.first_item["id"]
target_row_label = random_row.first_item["label"]
target_row_date = random_row.first_item["date"]
target_row_data = random_row.first_item["data"]
print("currently selected item id={}, label={}, date={}".format(target_row_id, target_row_label, target_row_date))
# Fetch the result of the query.
result = client.query(f"""
SELECT id, date, label, 
    distance(data, {target_row_data}) as dist FROM default.myscale_categorical_search ORDER BY dist LIMIT 10
""")
# Iterate through the rows of the query result and print the 'id', 'date',
# 'label', and distance for each row.
print("Top 10 candidates:")
for row in result.named_results():
    print(row["id"], row["date"], row["label"], row["dist"])

Sample Output:

id date label dist
0 2030-09-26 person 0
2 1975-10-07 animal 60,088
395,686 1975-05-04 animal 70,682
203,483 1982-11-28 building 72,585
597,767 2020-09-10 building 72,743
794,777 2015-04-03 person 74,797
591,738 2008-07-15 person 75,256
209,719 1978-06-13 building 76,462
608,767 1970-12-19 building 79,107
591,816 1995-03-20 building 79,390

The results are in the form of vector embeddings, which also can later be used to retrieve the original data by referencing the “id“ of the results.

In the earlier demonstration, a query was executed using vector embeddings. Additionally, users can also perform queries that utilize a combination of structured and vector data in SQL.

  • Python
  • SQL
 # Fetch the result of the query.
result = client.query(f"""
SELECT id, date, label, 
    distance(data, {target_row_data}) as dist
    FROM default.myscale_categorical_search WHERE toYear(date) >= 2000 AND label = 'animal'
    ORDER BY dist LIMIT 10
""")
# Iterate through the rows of the query result and print the 'id', 'date',
# 'label', and distance for each row.
for row in result.named_results():
    print(row["id"], row["date"], row["label"], row["dist"])

Sample Output:

id date label dist
601,326 2001-05-09 animal 83,481
406,181 2004-12-18 animal 93,655
13,369 2003-01-31 animal 95,158
209,834 2031-01-24 animal 97,258
10,216 2011-08-02 animal 103,297
605,180 2009-04-20 animal 103,839
21,768 2021-01-27 animal 105,764
1,988 2000-03-02 animal 107,305
598,464 2003-01-06 animal 109,670
200,525 2024-11-06 animal 110,029