ClickHouse · noramullen1 · Feb 12, 2025 · Feb 13, 2025 · Feb 13, 2025 · Feb 13, 2025
@@ -2,29 +2,25 @@
 slug: /en/tutorial
 sidebar_label: Advanced Tutorial
 sidebar_position: 0.5
-keywords: [clickhouse, install, tutorial, dictionary, dictionaries]
+keywords: [clickhouse, install, tutorial, dictionary, dictionaries, example, advanced, taxi, new york, nyc]
 ---
-import SQLConsoleDetail from '@site/docs/en/_snippets/_launch_sql_console.md';
-
 # Advanced Tutorial
 
-## What to Expect from This Tutorial?
-
-In this tutorial, you will create a table and insert a large dataset (two million rows of the [New York taxi data](/docs/en/getting-started/example-datasets/nyc-taxi.md)). Then you will run queries on the dataset, including an example of how to create a dictionary and use it to perform a JOIN.
+## Overview
 
-:::note
-This tutorial assumes you have access to a running ClickHouse service.  If not, check out the [Quick Start](./quick-start.mdx).
-:::
+Learn how to ingest and query data in ClickHouse using a New York City taxi example dataset. 
 
-## 1. Create a New Table
+### Prerequisites
+You need access to a running ClickHouse service to complete this tutorial. For instructions, see the [Quick Start](./quick-start.mdx) guide.
 
-The New York City taxi data contains the details of millions of taxi rides, with columns like pickup and drop-off times and locations, cost, tip amount, tolls, payment type and so on. Let's create a table to store this data...
+## Create a new table
 
-1. Connect to the SQL console
+The New York City taxi dataset contains details about millions of taxi rides, with columns including tip amount, tolls, payment type, and more. Create a table to store this data.
 
-  <SQLConsoleDetail />
+1. Connect to the SQL console:
+- For ClickHouse Cloud, select **SQL Console** from the left navigation menu.
+- For self-managed ClickHouse, connect to the SQL console at `https://_hostname_:8443/play`. Check with your ClickHouse administrator for the details.
 
-  If you are using self-managed ClickHouse you can connect to the SQL console at https://_hostname_:8443/play (check with your ClickHouse administrator for the details).
 
 2. Create the following `trips` table in the `default` database:
     ```sql
@@ -81,9 +77,9 @@ The New York City taxi data contains the details of millions of taxi rides, with
     ORDER BY pickup_datetime;
     ```
 
-## 2. Insert the Dataset
+## Add the dataset
 
-Now that you have a table created, let's add the NYC taxi data. It is in CSV files in S3, and you can load the data from there.
+Now that you've created a table, add the New York City taxi data from CSV files in S3.
 
 1. The following command inserts ~2,000,000 rows into your `trips` table from two different files in S3: `trips_1.tsv.gz` and `trips_2.tsv.gz`:
     ```sql
@@ -139,56 +135,52 @@ Now that you have a table created, let's add the NYC taxi data. It is in CSV fil
     ") SETTINGS input_format_try_infer_datetimes = 0
     ```
 
-2. Wait for the `INSERT` to finish - it might take a moment for the 150 MB of data to be downloaded.
+2. Wait for the `INSERT` to finish. It might take a moment for the 150 MB of data to be downloaded.
 
-    :::note
-    The `s3` function cleverly knows how to decompress the data, and the `TabSeparatedWithNames` format tells ClickHouse that the data is tab-separated and also to skip the header row of each file.
-    :::
 
 3. When the insert is finished, verify it worked:
     ```sql
     SELECT count() FROM trips
     ```
 
-    You should see about 2M rows (1,999,657 rows, to be precise).
-
-    :::note
-    Notice how quickly and how few rows ClickHouse had to process to determine the count? You can get back the count in 0.001 seconds with only 6 rows processed.
-    :::
-
-4. If you run a query that needs to hit every row, you will notice considerably more rows need to be processed, but the run time is still blazing fast:
-    ```sql
-    SELECT DISTINCT(pickup_ntaname) FROM trips
-    ```
-
-    This query has to process 2M rows and return 190 values, but notice it does this in about 1 second. The `pickup_ntaname` column represents the name of the neighborhood in New York City where the taxi ride originated.
+    This query should return 1,999,657 rows.
 
-## 3. Analyze the Data
+## Analyze the data
 
-Let's run some queries to analyze the 2M rows of data...
+Run some queries to analyze the data. Explore the following examples or try your own SQL query. 
 
-1. We will start with some simple calculations, like computing the average tip amount:
+- Calculate the average tip amount:
     ```sql
     SELECT round(avg(tip_amount), 2) FROM trips
     ```
-
-    The response is:
+    <details>
+    <summary>Expected output</summary>
+    <p>
+
     ```response
     ┌─round(avg(tip_amount), 2)─┐
     │                      1.68 │
     └───────────────────────────┘
     ```
 
-2. This query computes the average cost based on the number of passengers:
+    </p>
+    </details>
+
+- Calculate the average cost based on the number of passengers:
     ```sql
     SELECT
         passenger_count,
         ceil(avg(total_amount),2) AS average_total_amount
     FROM trips
     GROUP BY passenger_count
     ```
+
+    <details>
+    <summary>Expected output</summary>
+    <p>
 
     The `passenger_count` ranges from 0 to 9:
+
     ```response
     ┌─passenger_count─┬─average_total_amount─┐
     │               0 │                22.69 │
@@ -204,7 +196,10 @@ Let's run some queries to analyze the 2M rows of data...
     └─────────────────┴──────────────────────┘
     ```
 
-3. Here is a query that calculates the daily number of pickups per neighborhood:
+    </p>
+    </details>
+
+- Calculate the daily number of pickups per neighborhood:
     ```sql
     SELECT
         pickup_date,
@@ -215,7 +210,10 @@ Let's run some queries to analyze the 2M rows of data...
     ORDER BY pickup_date ASC
     ```
 
-    The result looks like:
+    <details>
+    <summary>Expected output</summary>
+    <p>
+
     ```response
     ┌─pickup_date─┬─pickup_ntaname───────────────────────────────────────────┬─number_of_trips─┐
     │  2015-07-01 │ Brooklyn Heights-Cobble Hill                             │              13 │
@@ -229,8 +227,10 @@ Let's run some queries to analyze the 2M rows of data...
     │  2015-07-01 │ Bushwick South                                           │               5 │
     ```
 
+    </p>
+    </details>
 
-4. This query computes the length of the trip and groups the results by that value:
+- Calculate the length of each trip in minutes, then group the results by trip length:
     ```sql
     SELECT
         avg(tip_amount) AS avg_tip,
@@ -243,8 +243,10 @@ Let's run some queries to analyze the 2M rows of data...
     GROUP BY trip_minutes
     ORDER BY trip_minutes DESC
     ```
-
-    The result looks like:
+    <details>
+    <summary>Expected output</summary>
+    <p>
+
     ```response
     ┌──────────────avg_tip─┬───────────avg_fare─┬──────avg_passenger─┬──count─┬─trip_minutes─┐
     │   1.9600000381469727 │                  8 │                  1 │      1 │        27511 │
@@ -255,9 +257,11 @@ Let's run some queries to analyze the 2M rows of data...
     │   0.9682692398245518 │ 14.134615384615385 │  2.076923076923077 │    104 │         1436 │
     │   1.1022105210705808 │ 13.778947368421052 │  2.042105263157895 │     95 │         1435 │
     ```
+    </p>
+    </details>
 
 
-5. This query shows the number of pickups in each neighborhood, broken down by hour of the day:
+- Show the number of pickups in each neighborhood broken down by hour of the day:
     ```sql
     SELECT
         pickup_ntaname,
@@ -268,8 +272,10 @@ Let's run some queries to analyze the 2M rows of data...
     GROUP BY pickup_ntaname, pickup_hour
     ORDER BY pickup_ntaname, pickup_hour
     ```
+    <details>
+    <summary>Expected output</summary>
+    <p>
 
-    The result looks like:
     ```response
     ┌─pickup_ntaname───────────────────────────────────────────┬─pickup_hour─┬─pickups─┐
     │ Airport                                                  │           0 │    3509 │
@@ -308,7 +314,11 @@ Let's run some queries to analyze the 2M rows of data...
     │ Arden Heights                                            │          11 │       1 │
     ```
 
-7. Let's look at rides to LaGuardia or JFK airports:
+    </p>
+    </details>
+
+
+7. Retrieve  rides to LaGuardia or JFK airports:
     ```sql
     SELECT
         pickup_datetime,
@@ -328,7 +338,10 @@ Let's run some queries to analyze the 2M rows of data...
     ORDER BY pickup_datetime
     ```
 
-    The response is:
+    <details>
+    <summary>Expected output</summary>
+    <p>
+
     ```response
     ┌─────pickup_datetime─┬────dropoff_datetime─┬─total_amount─┬─pickup_nyct2010_gid─┬─dropoff_nyct2010_gid─┬─airport_code─┬─year─┬─day─┬─hour─┐
     │ 2015-07-01 00:04:14 │ 2015-07-01 00:15:29 │         13.3 │                 -34 │                  132 │ JFK          │ 2015 │   1 │    0 │
@@ -341,24 +354,31 @@ Let's run some queries to analyze the 2M rows of data...
     │ 2015-07-01 00:41:48 │ 2015-07-01 00:44:45 │          6.3 │                 -94 │                  132 │ JFK          │ 2015 │   1 │    0 │
     │ 2015-07-01 01:06:18 │ 2015-07-01 01:14:43 │        11.76 │                  37 │                  132 │ JFK          │ 2015 │   1 │    1 │
     ```
-## 4. Create a Dictionary
 
-If you are new to ClickHouse, it is important to understand how ***dictionaries*** work. A simple way of thinking about a dictionary is a mapping of key->value pairs that is stored in memory. The details and all the options for dictionaries are linked at the end of the tutorial.
+    </p>
+    </details>
+
+## Create a dictionary
+
+A dictionary is a mapping of key-value pairs stored in memory. For details, see [Dictionaries](/docs/en/sql-reference/dictionaries/index.md) 
 
-1. Let's see how to create a dictionary associated with a table in your ClickHouse service. The table and therefore the dictionary, will be based on a CSV file that contains 265 rows, one row for each neighborhood in NYC. The neighborhoods are mapped to the names of the NYC boroughs (NYC has 5 boroughs: the Bronx, Brooklyn, Manhattan, Queens and Staten Island), and this file counts Newark Airport (EWR) as a borough as well.
+Create a dictonary associated with a table in your ClickHouse service.
+The table and dictionary are based on a CSV file that contains a row for each neighborhood in New York City. 
 
-  This is part of the CSV file (shown as a table for clarity).  The `LocationID` column in the file maps to the `pickup_nyct2010_gid` and `dropoff_nyct2010_gid` columns in your `trips` table:
+The neighborhoods are mapped to the names of the five New York City boroughs (Bronx, Brooklyn, Manhattan, Queens and Staten Island), as well as Newark Airport (EWR).
 
-    | LocationID      | Borough |  Zone      | service_zone |
-    | ----------- | ----------- |   ----------- | ----------- |
-    | 1      | EWR       |  Newark Airport   | EWR        |
-    | 2    |   Queens     |   Jamaica Bay   |      Boro Zone   |
-    | 3   |   Bronx     |  Allerton/Pelham Gardens    |    Boro Zone     |
-    | 4     |    Manhattan    |    Alphabet City  |     Yellow Zone    |
-    | 5     |  Staten Island      |   Arden Heights   |    Boro Zone     |
+Here's an excerpt from the CSV file you're using in table format. The `LocationID` column in the file maps to the `pickup_nyct2010_gid` and `dropoff_nyct2010_gid` columns in your `trips` table:
 
+  | LocationID      | Borough |  Zone      | service_zone |
+  | ----------- | ----------- |   ----------- | ----------- |
+  | 1      | EWR       |  Newark Airport   | EWR        |
+  | 2    |   Queens     |   Jamaica Bay   |      Boro Zone   |
+  | 3   |   Bronx     |  Allerton/Pelham Gardens    |    Boro Zone     |
+  | 4     |    Manhattan    |    Alphabet City  |     Yellow Zone    |
+  | 5     |  Staten Island      |   Arden Heights   |    Boro Zone     |
 
-2. The URL for the file is `https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv`. Run the following SQL, which creates a Dictionary named `taxi_zone_dictionary` and populates the dictionary from the CSV file in S3:
+
+1. Run the following SQL command, which creates a dictionary named `taxi_zone_dictionary` and populates the dictionary from the CSV file in S3. The URL for the file is `https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv`. 
   ```sql
   CREATE DICTIONARY taxi_zone_dictionary
   (
@@ -374,29 +394,22 @@ If you are new to ClickHouse, it is important to understand how ***dictionaries*
   ```
 
   :::note
-  Setting `LIFETIME` to 0 means this dictionary will never update with its source. It is used here to not send unnecessary traffic to our S3 bucket, but in general you could specify any lifetime values you prefer.
-
-    For example:
-
-    ```sql
-    LIFETIME(MIN 1 MAX 10)
-    ```
-    specifies the dictionary to update after some random time between 1 and 10 seconds. (The random time is necessary in order to distribute the load on the dictionary source when updating on a large number of servers.)
+  Setting `LIFETIME` to 0 disables automatic updates to avoid unnecessary traffic to our S3 bucket. In other cases, you might configure it diffently. For details, see [Refreshing dictionary data using LIFETIME](/docs/en/sql-reference/dictionaries#refreshing-dictionary-data-using-lifetime).
   :::
 
-3. Verify it worked - you should get 265 rows (one row for each neighborhood):
+3. Verify it worked. The following should return 265 rows, or one row for each neighborhood:
     ```sql
     SELECT * FROM taxi_zone_dictionary
     ```
 
 4. Use the `dictGet` function ([or its variations](./sql-reference/functions/ext-dict-functions.md)) to retrieve a value from a dictionary. You pass in the name of the dictionary, the value you want, and the key (which in our example is the `LocationID` column of `taxi_zone_dictionary`).
 
-    For example, the following query returns the `Borough` whose `LocationID` is 132 (which as we saw above is JFK airport):
+    For example, the following query returns the `Borough` whose `LocationID` is 132, which corresponds to JFK airport):
     ```sql
     SELECT dictGet('taxi_zone_dictionary', 'Borough', 132)
     ```
 
-    JFK is in Queens, and notice the time to retrieve the value is essentially 0:
+    JFK is in Queens. Notice the time to retrieve the value is essentially 0:
     ```response
     ┌─dictGet('taxi_zone_dictionary', 'Borough', 132)─┐
     │ Queens                                          │
@@ -405,7 +418,7 @@ If you are new to ClickHouse, it is important to understand how ***dictionaries*
     1 rows in set. Elapsed: 0.004 sec.
     ```
 
-5. Use the `dictHas` function to see if a key is present in the dictionary. For example, the following query returns 1 (which is "true" in ClickHouse):
+5. Use the `dictHas` function to see if a key is present in the dictionary. For example, the following query returns `1` (which is "true" in ClickHouse):
     ```sql
     SELECT dictHas('taxi_zone_dictionary', 132)
     ```
@@ -442,11 +455,11 @@ If you are new to ClickHouse, it is important to understand how ***dictionaries*
     ```
 
 
-## 5. Perform a Join
+## Perform a join
 
-Let's write some queries that join the `taxi_zone_dictionary` with your `trips` table.
+Write some queries that join the `taxi_zone_dictionary` with your `trips` table.
 
-1. We can start with a simple JOIN that acts similarly to the previous airport query above:
+1. Start with a simple `JOIN` that acts similarly to the previous airport query above:
     ```sql
     SELECT
         count(1) AS total,
@@ -458,7 +471,7 @@ Let's write some queries that join the `taxi_zone_dictionary` with your `trips`
     ORDER BY total DESC
     ```
 
-    The response looks familiar:
+    The response looks is identical to the `dictGet` query:
     ```response
     ┌─total─┬─Borough───────┐
     │  7053 │ Manhattan     │
@@ -476,7 +489,7 @@ Let's write some queries that join the `taxi_zone_dictionary` with your `trips`
     Notice the output of the above `JOIN` query is the same as the query before it that used `dictGetOrDefault` (except that the `Unknown` values are not included). Behind the scenes, ClickHouse is actually calling the `dictGet` function for the `taxi_zone_dictionary` dictionary, but the `JOIN` syntax is more familiar for SQL developers.
     :::
 
-2. We do not use `SELECT *` often in ClickHouse - you should only retrieve the columns you actually need! But it is difficult to find a query that takes a long time, so this query purposely selects every column and returns every row (except there is a built-in 10,000 row maximum in the response by default), and also does a right join of every row with the dictionary:
+2. This query returns rows for the the 1000 trips with the highest tip amount, then performs an inner join of each row with the dictionary:
     ```sql
     SELECT *
     FROM trips
@@ -486,14 +499,17 @@ Let's write some queries that join the `taxi_zone_dictionary` with your `trips`
     ORDER BY tip_amount DESC
     LIMIT 1000
     ```
+        :::note
+        Generally, we avoid using `SELECT *` often in ClickHouse. You should only retrieve the columns you actually need. However, in this example, we wanted it to be slow because why?
+        :::
+
 
-#### Congrats!
+## Next steps
 
-Well done - you made it through the tutorial, and hopefully you have a better understanding of how to use ClickHouse. Here are some options for what to do next:
+Learn more about ClickHouse with the following documentation:
 
-- Read [how primary keys work in ClickHouse](./guides/best-practices/sparse-primary-indexes.md) - this knowledge will move you a long ways forward along your journey to becoming a ClickHouse expert
-- [Integrate an external data source](/docs/en/integrations/index.mdx) like files, Kafka, PostgreSQL, data pipelines, or lots of other data sources
-- [Connect your favorite UI/BI tool](./integrations/data-visualization/index.md) to ClickHouse
-- Check out the [SQL Reference](./sql-reference/index.md) and browse through the various functions. ClickHouse has an amazing collection of functions for transforming, processing and analyzing data
-- Learn more about [Dictionaries](/docs/en/sql-reference/dictionaries/index.md)
+- [Introduction to Primary Indexes in ClickHouse](./guides/best-practices/sparse-primary-indexes.md): Learn how ClickHouse uses sparse primary indexes to efficiently locate relevant data during queries. 
+- [Integrate an external data source](/docs/en/integrations/index.mdx): Review data source integration options, including files, Kafka, PostgreSQL, data pipelines, and many others.
+- [Visualize data in ClickHouse](./integrations/data-visualization/index.md): Connect your favorite UI/BI tool to ClickHouse.
+- [SQL Reference](./sql-reference/index.md): Browse the SQL functions available in ClickHouse for transforming, processing and analyzing data.