Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Advanced Tutorial doc #3266

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
182 changes: 99 additions & 83 deletions docs/en/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,25 @@
slug: /en/tutorial
sidebar_label: Advanced Tutorial
sidebar_position: 0.5
keywords: [clickhouse, install, tutorial, dictionary, dictionaries]
keywords: [clickhouse, install, tutorial, dictionary, dictionaries, example, advanced, taxi, new york, nyc]
---
import SQLConsoleDetail from '@site/docs/en/_snippets/_launch_sql_console.md';

# Advanced Tutorial

## What to Expect from This Tutorial?

In this tutorial, you will create a table and insert a large dataset (two million rows of the [New York taxi data](/docs/en/getting-started/example-datasets/nyc-taxi.md)). Then you will run queries on the dataset, including an example of how to create a dictionary and use it to perform a JOIN.
## Overview

:::note
This tutorial assumes you have access to a running ClickHouse service. If not, check out the [Quick Start](./quick-start.mdx).
:::
Learn how to ingest and query data in ClickHouse using a New York City taxi example dataset.

## 1. Create a New Table
### Prerequisites
You need access to a running ClickHouse service to complete this tutorial. For instructions, see the [Quick Start](./quick-start.mdx) guide.

The New York City taxi data contains the details of millions of taxi rides, with columns like pickup and drop-off times and locations, cost, tip amount, tolls, payment type and so on. Let's create a table to store this data...
## Create a new table

1. Connect to the SQL console
The New York City taxi dataset contains details about millions of taxi rides, with columns including tip amount, tolls, payment type, and more. Create a table to store this data.

<SQLConsoleDetail />
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two giant screenshots seem extremely unnecessary and disruptive to the reader's experience.

1. Connect to the SQL console:
- For ClickHouse Cloud, select **SQL Console** from the left navigation menu.
- For self-managed ClickHouse, connect to the SQL console at `https://_hostname_:8443/play`. Check with your ClickHouse administrator for the details.

If you are using self-managed ClickHouse you can connect to the SQL console at https://_hostname_:8443/play (check with your ClickHouse administrator for the details).

2. Create the following `trips` table in the `default` database:
```sql
Expand Down Expand Up @@ -81,9 +77,9 @@ The New York City taxi data contains the details of millions of taxi rides, with
ORDER BY pickup_datetime;
```

## 2. Insert the Dataset
## Add the dataset

Now that you have a table created, let's add the NYC taxi data. It is in CSV files in S3, and you can load the data from there.
Now that you've created a table, add the New York City taxi data from CSV files in S3.

1. The following command inserts ~2,000,000 rows into your `trips` table from two different files in S3: `trips_1.tsv.gz` and `trips_2.tsv.gz`:
```sql
Expand Down Expand Up @@ -139,56 +135,52 @@ Now that you have a table created, let's add the NYC taxi data. It is in CSV fil
") SETTINGS input_format_try_infer_datetimes = 0
```

2. Wait for the `INSERT` to finish - it might take a moment for the 150 MB of data to be downloaded.
2. Wait for the `INSERT` to finish. It might take a moment for the 150 MB of data to be downloaded.

:::note
The `s3` function cleverly knows how to decompress the data, and the `TabSeparatedWithNames` format tells ClickHouse that the data is tab-separated and also to skip the header row of each file.
:::

3. When the insert is finished, verify it worked:
```sql
SELECT count() FROM trips
```

You should see about 2M rows (1,999,657 rows, to be precise).

:::note
Notice how quickly and how few rows ClickHouse had to process to determine the count? You can get back the count in 0.001 seconds with only 6 rows processed.
:::

4. If you run a query that needs to hit every row, you will notice considerably more rows need to be processed, but the run time is still blazing fast:
```sql
SELECT DISTINCT(pickup_ntaname) FROM trips
```

This query has to process 2M rows and return 190 values, but notice it does this in about 1 second. The `pickup_ntaname` column represents the name of the neighborhood in New York City where the taxi ride originated.
This query should return 1,999,657 rows.

## 3. Analyze the Data
## Analyze the data

Let's run some queries to analyze the 2M rows of data...
Run some queries to analyze the data. Explore the following examples or try your own SQL query.

1. We will start with some simple calculations, like computing the average tip amount:
- Calculate the average tip amount:
```sql
SELECT round(avg(tip_amount), 2) FROM trips
```

The response is:
<details>
<summary>Expected output</summary>
<p>

```response
┌─round(avg(tip_amount), 2)─┐
│ 1.68 │
└───────────────────────────┘
```

2. This query computes the average cost based on the number of passengers:
</p>
</details>

- Calculate the average cost based on the number of passengers:
```sql
SELECT
passenger_count,
ceil(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_count
```

<details>
<summary>Expected output</summary>
<p>

The `passenger_count` ranges from 0 to 9:

```response
┌─passenger_count─┬─average_total_amount─┐
│ 0 │ 22.69 │
Expand All @@ -204,7 +196,10 @@ Let's run some queries to analyze the 2M rows of data...
└─────────────────┴──────────────────────┘
```

3. Here is a query that calculates the daily number of pickups per neighborhood:
</p>
</details>

- Calculate the daily number of pickups per neighborhood:
```sql
SELECT
pickup_date,
Expand All @@ -215,7 +210,10 @@ Let's run some queries to analyze the 2M rows of data...
ORDER BY pickup_date ASC
```

The result looks like:
<details>
<summary>Expected output</summary>
<p>

```response
┌─pickup_date─┬─pickup_ntaname───────────────────────────────────────────┬─number_of_trips─┐
│ 2015-07-01 │ Brooklyn Heights-Cobble Hill │ 13 │
Expand All @@ -229,8 +227,10 @@ Let's run some queries to analyze the 2M rows of data...
│ 2015-07-01 │ Bushwick South │ 5 │
```

</p>
</details>

4. This query computes the length of the trip and groups the results by that value:
- Calculate the length of each trip in minutes, then group the results by trip length:
```sql
SELECT
avg(tip_amount) AS avg_tip,
Expand All @@ -243,8 +243,10 @@ Let's run some queries to analyze the 2M rows of data...
GROUP BY trip_minutes
ORDER BY trip_minutes DESC
```

The result looks like:
<details>
<summary>Expected output</summary>
<p>

```response
┌──────────────avg_tip─┬───────────avg_fare─┬──────avg_passenger─┬──count─┬─trip_minutes─┐
│ 1.9600000381469727 │ 8 │ 1 │ 1 │ 27511 │
Expand All @@ -255,9 +257,11 @@ Let's run some queries to analyze the 2M rows of data...
│ 0.9682692398245518 │ 14.134615384615385 │ 2.076923076923077 │ 104 │ 1436 │
│ 1.1022105210705808 │ 13.778947368421052 │ 2.042105263157895 │ 95 │ 1435 │
```
</p>
</details>


5. This query shows the number of pickups in each neighborhood, broken down by hour of the day:
- Show the number of pickups in each neighborhood broken down by hour of the day:
```sql
SELECT
pickup_ntaname,
Expand All @@ -268,8 +272,10 @@ Let's run some queries to analyze the 2M rows of data...
GROUP BY pickup_ntaname, pickup_hour
ORDER BY pickup_ntaname, pickup_hour
```
<details>
<summary>Expected output</summary>
<p>

The result looks like:
```response
┌─pickup_ntaname───────────────────────────────────────────┬─pickup_hour─┬─pickups─┐
│ Airport │ 0 │ 3509 │
Expand Down Expand Up @@ -308,7 +314,11 @@ Let's run some queries to analyze the 2M rows of data...
│ Arden Heights │ 11 │ 1 │
```

7. Let's look at rides to LaGuardia or JFK airports:
</p>
</details>


7. Retrieve rides to LaGuardia or JFK airports:
```sql
SELECT
pickup_datetime,
Expand All @@ -328,7 +338,10 @@ Let's run some queries to analyze the 2M rows of data...
ORDER BY pickup_datetime
```

The response is:
<details>
<summary>Expected output</summary>
<p>

```response
┌─────pickup_datetime─┬────dropoff_datetime─┬─total_amount─┬─pickup_nyct2010_gid─┬─dropoff_nyct2010_gid─┬─airport_code─┬─year─┬─day─┬─hour─┐
│ 2015-07-01 00:04:14 │ 2015-07-01 00:15:29 │ 13.3 │ -34 │ 132 │ JFK │ 2015 │ 1 │ 0 │
Expand All @@ -341,24 +354,31 @@ Let's run some queries to analyze the 2M rows of data...
│ 2015-07-01 00:41:48 │ 2015-07-01 00:44:45 │ 6.3 │ -94 │ 132 │ JFK │ 2015 │ 1 │ 0 │
│ 2015-07-01 01:06:18 │ 2015-07-01 01:14:43 │ 11.76 │ 37 │ 132 │ JFK │ 2015 │ 1 │ 1 │
```
## 4. Create a Dictionary

If you are new to ClickHouse, it is important to understand how ***dictionaries*** work. A simple way of thinking about a dictionary is a mapping of key->value pairs that is stored in memory. The details and all the options for dictionaries are linked at the end of the tutorial.
</p>
</details>

## Create a dictionary

A dictionary is a mapping of key-value pairs stored in memory. For details, see [Dictionaries](/docs/en/sql-reference/dictionaries/index.md)

1. Let's see how to create a dictionary associated with a table in your ClickHouse service. The table and therefore the dictionary, will be based on a CSV file that contains 265 rows, one row for each neighborhood in NYC. The neighborhoods are mapped to the names of the NYC boroughs (NYC has 5 boroughs: the Bronx, Brooklyn, Manhattan, Queens and Staten Island), and this file counts Newark Airport (EWR) as a borough as well.
Create a dictonary associated with a table in your ClickHouse service.
The table and dictionary are based on a CSV file that contains a row for each neighborhood in New York City.

This is part of the CSV file (shown as a table for clarity). The `LocationID` column in the file maps to the `pickup_nyct2010_gid` and `dropoff_nyct2010_gid` columns in your `trips` table:
The neighborhoods are mapped to the names of the five New York City boroughs (Bronx, Brooklyn, Manhattan, Queens and Staten Island), as well as Newark Airport (EWR).

| LocationID | Borough | Zone | service_zone |
| ----------- | ----------- | ----------- | ----------- |
| 1 | EWR | Newark Airport | EWR |
| 2 | Queens | Jamaica Bay | Boro Zone |
| 3 | Bronx | Allerton/Pelham Gardens | Boro Zone |
| 4 | Manhattan | Alphabet City | Yellow Zone |
| 5 | Staten Island | Arden Heights | Boro Zone |
Here's an excerpt from the CSV file you're using in table format. The `LocationID` column in the file maps to the `pickup_nyct2010_gid` and `dropoff_nyct2010_gid` columns in your `trips` table:

| LocationID | Borough | Zone | service_zone |
| ----------- | ----------- | ----------- | ----------- |
| 1 | EWR | Newark Airport | EWR |
| 2 | Queens | Jamaica Bay | Boro Zone |
| 3 | Bronx | Allerton/Pelham Gardens | Boro Zone |
| 4 | Manhattan | Alphabet City | Yellow Zone |
| 5 | Staten Island | Arden Heights | Boro Zone |

2. The URL for the file is `https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv`. Run the following SQL, which creates a Dictionary named `taxi_zone_dictionary` and populates the dictionary from the CSV file in S3:

1. Run the following SQL command, which creates a dictionary named `taxi_zone_dictionary` and populates the dictionary from the CSV file in S3. The URL for the file is `https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv`.
```sql
CREATE DICTIONARY taxi_zone_dictionary
(
Expand All @@ -374,29 +394,22 @@ If you are new to ClickHouse, it is important to understand how ***dictionaries*
```

:::note
Setting `LIFETIME` to 0 means this dictionary will never update with its source. It is used here to not send unnecessary traffic to our S3 bucket, but in general you could specify any lifetime values you prefer.

For example:

```sql
LIFETIME(MIN 1 MAX 10)
```
specifies the dictionary to update after some random time between 1 and 10 seconds. (The random time is necessary in order to distribute the load on the dictionary source when updating on a large number of servers.)
Setting `LIFETIME` to 0 disables automatic updates to avoid unnecessary traffic to our S3 bucket. In other cases, you might configure it diffently. For details, see [Refreshing dictionary data using LIFETIME](/docs/en/sql-reference/dictionaries#refreshing-dictionary-data-using-lifetime).
:::

3. Verify it worked - you should get 265 rows (one row for each neighborhood):
3. Verify it worked. The following should return 265 rows, or one row for each neighborhood:
```sql
SELECT * FROM taxi_zone_dictionary
```

4. Use the `dictGet` function ([or its variations](./sql-reference/functions/ext-dict-functions.md)) to retrieve a value from a dictionary. You pass in the name of the dictionary, the value you want, and the key (which in our example is the `LocationID` column of `taxi_zone_dictionary`).

For example, the following query returns the `Borough` whose `LocationID` is 132 (which as we saw above is JFK airport):
For example, the following query returns the `Borough` whose `LocationID` is 132, which corresponds to JFK airport):
```sql
SELECT dictGet('taxi_zone_dictionary', 'Borough', 132)
```

JFK is in Queens, and notice the time to retrieve the value is essentially 0:
JFK is in Queens. Notice the time to retrieve the value is essentially 0:
```response
┌─dictGet('taxi_zone_dictionary', 'Borough', 132)─┐
│ Queens │
Expand All @@ -405,7 +418,7 @@ If you are new to ClickHouse, it is important to understand how ***dictionaries*
1 rows in set. Elapsed: 0.004 sec.
```

5. Use the `dictHas` function to see if a key is present in the dictionary. For example, the following query returns 1 (which is "true" in ClickHouse):
5. Use the `dictHas` function to see if a key is present in the dictionary. For example, the following query returns `1` (which is "true" in ClickHouse):
```sql
SELECT dictHas('taxi_zone_dictionary', 132)
```
Expand Down Expand Up @@ -442,11 +455,11 @@ If you are new to ClickHouse, it is important to understand how ***dictionaries*
```


## 5. Perform a Join
## Perform a join

Let's write some queries that join the `taxi_zone_dictionary` with your `trips` table.
Write some queries that join the `taxi_zone_dictionary` with your `trips` table.

1. We can start with a simple JOIN that acts similarly to the previous airport query above:
1. Start with a simple `JOIN` that acts similarly to the previous airport query above:
```sql
SELECT
count(1) AS total,
Expand All @@ -458,7 +471,7 @@ Let's write some queries that join the `taxi_zone_dictionary` with your `trips`
ORDER BY total DESC
```

The response looks familiar:
The response looks is identical to the `dictGet` query:
```response
┌─total─┬─Borough───────┐
│ 7053 │ Manhattan │
Expand All @@ -476,7 +489,7 @@ Let's write some queries that join the `taxi_zone_dictionary` with your `trips`
Notice the output of the above `JOIN` query is the same as the query before it that used `dictGetOrDefault` (except that the `Unknown` values are not included). Behind the scenes, ClickHouse is actually calling the `dictGet` function for the `taxi_zone_dictionary` dictionary, but the `JOIN` syntax is more familiar for SQL developers.
:::

2. We do not use `SELECT *` often in ClickHouse - you should only retrieve the columns you actually need! But it is difficult to find a query that takes a long time, so this query purposely selects every column and returns every row (except there is a built-in 10,000 row maximum in the response by default), and also does a right join of every row with the dictionary:
2. This query returns rows for the the 1000 trips with the highest tip amount, then performs an inner join of each row with the dictionary:
```sql
SELECT *
FROM trips
Expand All @@ -486,14 +499,17 @@ Let's write some queries that join the `taxi_zone_dictionary` with your `trips`
ORDER BY tip_amount DESC
LIMIT 1000
```
:::note
Generally, we avoid using `SELECT *` often in ClickHouse. You should only retrieve the columns you actually need. However, in this example, we wanted it to be slow because why?
:::
Copy link
Author

@noramullen1 noramullen1 Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why you'd want this query to be slow.



#### Congrats!
## Next steps

Well done - you made it through the tutorial, and hopefully you have a better understanding of how to use ClickHouse. Here are some options for what to do next:
Learn more about ClickHouse with the following documentation:

- Read [how primary keys work in ClickHouse](./guides/best-practices/sparse-primary-indexes.md) - this knowledge will move you a long ways forward along your journey to becoming a ClickHouse expert
- [Integrate an external data source](/docs/en/integrations/index.mdx) like files, Kafka, PostgreSQL, data pipelines, or lots of other data sources
- [Connect your favorite UI/BI tool](./integrations/data-visualization/index.md) to ClickHouse
- Check out the [SQL Reference](./sql-reference/index.md) and browse through the various functions. ClickHouse has an amazing collection of functions for transforming, processing and analyzing data
- Learn more about [Dictionaries](/docs/en/sql-reference/dictionaries/index.md)
- [Introduction to Primary Indexes in ClickHouse](./guides/best-practices/sparse-primary-indexes.md): Learn how ClickHouse uses sparse primary indexes to efficiently locate relevant data during queries.
- [Integrate an external data source](/docs/en/integrations/index.mdx): Review data source integration options, including files, Kafka, PostgreSQL, data pipelines, and many others.
- [Visualize data in ClickHouse](./integrations/data-visualization/index.md): Connect your favorite UI/BI tool to ClickHouse.
- [SQL Reference](./sql-reference/index.md): Browse the SQL functions available in ClickHouse for transforming, processing and analyzing data.

Loading