-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to Advanced Tutorial doc #3266
Open
noramullen1
wants to merge
14
commits into
ClickHouse:main
Choose a base branch
from
noramullen1:tutorial-updates
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 10 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
e2f2490
Update tutorial.md
noramullen1 abc4c0d
Update tutorial.md
noramullen1 adf5373
Merge branch 'ClickHouse:main' into tutorial-updates
noramullen1 bc687a5
Update tutorial.md
noramullen1 1b6fe1e
Merge branch 'tutorial-updates' of https://github.com/noramullen1/cli…
noramullen1 8030a28
Update tutorial.md
noramullen1 1186e63
Update tutorial.md
noramullen1 1f3678d
Update tutorial.md
noramullen1 a511375
Update tutorial.md
noramullen1 7e3a0ad
Update tutorial.md
noramullen1 3e9f5b4
Update tutorial.md
noramullen1 e3c7424
Update aspell-dict-file.txt
noramullen1 1b820ce
Update tutorial.md
noramullen1 92be0c9
fix file path
noramullen1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,29 +2,25 @@ | |
slug: /en/tutorial | ||
sidebar_label: Advanced Tutorial | ||
sidebar_position: 0.5 | ||
keywords: [clickhouse, install, tutorial, dictionary, dictionaries] | ||
keywords: [clickhouse, install, tutorial, dictionary, dictionaries, example, advanced, taxi, new york, nyc] | ||
--- | ||
import SQLConsoleDetail from '@site/docs/en/_snippets/_launch_sql_console.md'; | ||
|
||
# Advanced Tutorial | ||
|
||
## What to Expect from This Tutorial? | ||
|
||
In this tutorial, you will create a table and insert a large dataset (two million rows of the [New York taxi data](/docs/en/getting-started/example-datasets/nyc-taxi.md)). Then you will run queries on the dataset, including an example of how to create a dictionary and use it to perform a JOIN. | ||
## Overview | ||
|
||
:::note | ||
This tutorial assumes you have access to a running ClickHouse service. If not, check out the [Quick Start](./quick-start.mdx). | ||
::: | ||
Learn how to ingest and query data in ClickHouse using a New York City taxi example dataset. | ||
|
||
## 1. Create a New Table | ||
### Prerequisites | ||
You need access to a running ClickHouse service to complete this tutorial. For instructions, see the [Quick Start](./quick-start.mdx) guide. | ||
|
||
The New York City taxi data contains the details of millions of taxi rides, with columns like pickup and drop-off times and locations, cost, tip amount, tolls, payment type and so on. Let's create a table to store this data... | ||
## Create a new table | ||
|
||
1. Connect to the SQL console | ||
The New York City taxi dataset contains details about millions of taxi rides, with columns including tip amount, tolls, payment type, and more. Create a table to store this data. | ||
|
||
<SQLConsoleDetail /> | ||
1. Connect to the SQL console: | ||
- For ClickHouse Cloud, select **SQL Console** from the left navigation menu. | ||
- For self-managed ClickHouse, connect to the SQL console at `https://_hostname_:8443/play`. Check with your ClickHouse administrator for the details. | ||
|
||
If you are using self-managed ClickHouse you can connect to the SQL console at https://_hostname_:8443/play (check with your ClickHouse administrator for the details). | ||
|
||
2. Create the following `trips` table in the `default` database: | ||
```sql | ||
|
@@ -81,9 +77,9 @@ The New York City taxi data contains the details of millions of taxi rides, with | |
ORDER BY pickup_datetime; | ||
``` | ||
|
||
## 2. Insert the Dataset | ||
## Add the dataset | ||
|
||
Now that you have a table created, let's add the NYC taxi data. It is in CSV files in S3, and you can load the data from there. | ||
Now that you've created a table, add the New York City taxi data from CSV files in S3. | ||
|
||
1. The following command inserts ~2,000,000 rows into your `trips` table from two different files in S3: `trips_1.tsv.gz` and `trips_2.tsv.gz`: | ||
```sql | ||
|
@@ -139,56 +135,52 @@ Now that you have a table created, let's add the NYC taxi data. It is in CSV fil | |
") SETTINGS input_format_try_infer_datetimes = 0 | ||
``` | ||
|
||
2. Wait for the `INSERT` to finish - it might take a moment for the 150 MB of data to be downloaded. | ||
2. Wait for the `INSERT` to finish. It might take a moment for the 150 MB of data to be downloaded. | ||
|
||
:::note | ||
The `s3` function cleverly knows how to decompress the data, and the `TabSeparatedWithNames` format tells ClickHouse that the data is tab-separated and also to skip the header row of each file. | ||
::: | ||
|
||
3. When the insert is finished, verify it worked: | ||
```sql | ||
SELECT count() FROM trips | ||
``` | ||
|
||
You should see about 2M rows (1,999,657 rows, to be precise). | ||
|
||
:::note | ||
Notice how quickly and how few rows ClickHouse had to process to determine the count? You can get back the count in 0.001 seconds with only 6 rows processed. | ||
::: | ||
|
||
4. If you run a query that needs to hit every row, you will notice considerably more rows need to be processed, but the run time is still blazing fast: | ||
```sql | ||
SELECT DISTINCT(pickup_ntaname) FROM trips | ||
``` | ||
|
||
This query has to process 2M rows and return 190 values, but notice it does this in about 1 second. The `pickup_ntaname` column represents the name of the neighborhood in New York City where the taxi ride originated. | ||
This query should return 1,999,657 rows. | ||
|
||
## 3. Analyze the Data | ||
## Analyze the data | ||
|
||
Let's run some queries to analyze the 2M rows of data... | ||
Run some queries to analyze the data. Explore the following examples or try your own SQL query. | ||
|
||
1. We will start with some simple calculations, like computing the average tip amount: | ||
- Calculate the average tip amount: | ||
```sql | ||
SELECT round(avg(tip_amount), 2) FROM trips | ||
``` | ||
|
||
The response is: | ||
<details> | ||
<summary>Expected output</summary> | ||
<p> | ||
|
||
```response | ||
┌─round(avg(tip_amount), 2)─┐ | ||
│ 1.68 │ | ||
└───────────────────────────┘ | ||
``` | ||
|
||
2. This query computes the average cost based on the number of passengers: | ||
</p> | ||
</details> | ||
|
||
- Calculate the average cost based on the number of passengers: | ||
```sql | ||
SELECT | ||
passenger_count, | ||
ceil(avg(total_amount),2) AS average_total_amount | ||
FROM trips | ||
GROUP BY passenger_count | ||
``` | ||
|
||
<details> | ||
<summary>Expected output</summary> | ||
<p> | ||
|
||
The `passenger_count` ranges from 0 to 9: | ||
|
||
```response | ||
┌─passenger_count─┬─average_total_amount─┐ | ||
│ 0 │ 22.69 │ | ||
|
@@ -204,7 +196,10 @@ Let's run some queries to analyze the 2M rows of data... | |
└─────────────────┴──────────────────────┘ | ||
``` | ||
|
||
3. Here is a query that calculates the daily number of pickups per neighborhood: | ||
</p> | ||
</details> | ||
|
||
- Calculate the daily number of pickups per neighborhood: | ||
```sql | ||
SELECT | ||
pickup_date, | ||
|
@@ -215,7 +210,10 @@ Let's run some queries to analyze the 2M rows of data... | |
ORDER BY pickup_date ASC | ||
``` | ||
|
||
The result looks like: | ||
<details> | ||
<summary>Expected output</summary> | ||
<p> | ||
|
||
```response | ||
┌─pickup_date─┬─pickup_ntaname───────────────────────────────────────────┬─number_of_trips─┐ | ||
│ 2015-07-01 │ Brooklyn Heights-Cobble Hill │ 13 │ | ||
|
@@ -229,8 +227,10 @@ Let's run some queries to analyze the 2M rows of data... | |
│ 2015-07-01 │ Bushwick South │ 5 │ | ||
``` | ||
|
||
</p> | ||
</details> | ||
|
||
4. This query computes the length of the trip and groups the results by that value: | ||
- Calculate the length of each trip in minutes, then group the results by trip length: | ||
```sql | ||
SELECT | ||
avg(tip_amount) AS avg_tip, | ||
|
@@ -243,8 +243,10 @@ Let's run some queries to analyze the 2M rows of data... | |
GROUP BY trip_minutes | ||
ORDER BY trip_minutes DESC | ||
``` | ||
|
||
The result looks like: | ||
<details> | ||
<summary>Expected output</summary> | ||
<p> | ||
|
||
```response | ||
┌──────────────avg_tip─┬───────────avg_fare─┬──────avg_passenger─┬──count─┬─trip_minutes─┐ | ||
│ 1.9600000381469727 │ 8 │ 1 │ 1 │ 27511 │ | ||
|
@@ -255,9 +257,11 @@ Let's run some queries to analyze the 2M rows of data... | |
│ 0.9682692398245518 │ 14.134615384615385 │ 2.076923076923077 │ 104 │ 1436 │ | ||
│ 1.1022105210705808 │ 13.778947368421052 │ 2.042105263157895 │ 95 │ 1435 │ | ||
``` | ||
</p> | ||
</details> | ||
|
||
|
||
5. This query shows the number of pickups in each neighborhood, broken down by hour of the day: | ||
- Show the number of pickups in each neighborhood broken down by hour of the day: | ||
```sql | ||
SELECT | ||
pickup_ntaname, | ||
|
@@ -268,8 +272,10 @@ Let's run some queries to analyze the 2M rows of data... | |
GROUP BY pickup_ntaname, pickup_hour | ||
ORDER BY pickup_ntaname, pickup_hour | ||
``` | ||
<details> | ||
<summary>Expected output</summary> | ||
<p> | ||
|
||
The result looks like: | ||
```response | ||
┌─pickup_ntaname───────────────────────────────────────────┬─pickup_hour─┬─pickups─┐ | ||
│ Airport │ 0 │ 3509 │ | ||
|
@@ -308,7 +314,11 @@ Let's run some queries to analyze the 2M rows of data... | |
│ Arden Heights │ 11 │ 1 │ | ||
``` | ||
|
||
7. Let's look at rides to LaGuardia or JFK airports: | ||
</p> | ||
</details> | ||
|
||
|
||
7. Retrieve rides to LaGuardia or JFK airports: | ||
```sql | ||
SELECT | ||
pickup_datetime, | ||
|
@@ -328,7 +338,10 @@ Let's run some queries to analyze the 2M rows of data... | |
ORDER BY pickup_datetime | ||
``` | ||
|
||
The response is: | ||
<details> | ||
<summary>Expected output</summary> | ||
<p> | ||
|
||
```response | ||
┌─────pickup_datetime─┬────dropoff_datetime─┬─total_amount─┬─pickup_nyct2010_gid─┬─dropoff_nyct2010_gid─┬─airport_code─┬─year─┬─day─┬─hour─┐ | ||
│ 2015-07-01 00:04:14 │ 2015-07-01 00:15:29 │ 13.3 │ -34 │ 132 │ JFK │ 2015 │ 1 │ 0 │ | ||
|
@@ -341,24 +354,31 @@ Let's run some queries to analyze the 2M rows of data... | |
│ 2015-07-01 00:41:48 │ 2015-07-01 00:44:45 │ 6.3 │ -94 │ 132 │ JFK │ 2015 │ 1 │ 0 │ | ||
│ 2015-07-01 01:06:18 │ 2015-07-01 01:14:43 │ 11.76 │ 37 │ 132 │ JFK │ 2015 │ 1 │ 1 │ | ||
``` | ||
## 4. Create a Dictionary | ||
|
||
If you are new to ClickHouse, it is important to understand how ***dictionaries*** work. A simple way of thinking about a dictionary is a mapping of key->value pairs that is stored in memory. The details and all the options for dictionaries are linked at the end of the tutorial. | ||
</p> | ||
</details> | ||
|
||
## Create a dictionary | ||
|
||
A dictionary is a mapping of key-value pairs stored in memory. For details, see [Dictionaries](/docs/en/sql-reference/dictionaries/index.md) | ||
|
||
1. Let's see how to create a dictionary associated with a table in your ClickHouse service. The table and therefore the dictionary, will be based on a CSV file that contains 265 rows, one row for each neighborhood in NYC. The neighborhoods are mapped to the names of the NYC boroughs (NYC has 5 boroughs: the Bronx, Brooklyn, Manhattan, Queens and Staten Island), and this file counts Newark Airport (EWR) as a borough as well. | ||
Create a dictonary associated with a table in your ClickHouse service. | ||
The table and dictionary are based on a CSV file that contains a row for each neighborhood in New York City. | ||
|
||
This is part of the CSV file (shown as a table for clarity). The `LocationID` column in the file maps to the `pickup_nyct2010_gid` and `dropoff_nyct2010_gid` columns in your `trips` table: | ||
The neighborhoods are mapped to the names of the five New York City boroughs (Bronx, Brooklyn, Manhattan, Queens and Staten Island), as well as Newark Airport (EWR). | ||
|
||
| LocationID | Borough | Zone | service_zone | | ||
| ----------- | ----------- | ----------- | ----------- | | ||
| 1 | EWR | Newark Airport | EWR | | ||
| 2 | Queens | Jamaica Bay | Boro Zone | | ||
| 3 | Bronx | Allerton/Pelham Gardens | Boro Zone | | ||
| 4 | Manhattan | Alphabet City | Yellow Zone | | ||
| 5 | Staten Island | Arden Heights | Boro Zone | | ||
Here's an excerpt from the CSV file you're using in table format. The `LocationID` column in the file maps to the `pickup_nyct2010_gid` and `dropoff_nyct2010_gid` columns in your `trips` table: | ||
|
||
| LocationID | Borough | Zone | service_zone | | ||
| ----------- | ----------- | ----------- | ----------- | | ||
| 1 | EWR | Newark Airport | EWR | | ||
| 2 | Queens | Jamaica Bay | Boro Zone | | ||
| 3 | Bronx | Allerton/Pelham Gardens | Boro Zone | | ||
| 4 | Manhattan | Alphabet City | Yellow Zone | | ||
| 5 | Staten Island | Arden Heights | Boro Zone | | ||
|
||
2. The URL for the file is `https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv`. Run the following SQL, which creates a Dictionary named `taxi_zone_dictionary` and populates the dictionary from the CSV file in S3: | ||
|
||
1. Run the following SQL command, which creates a dictionary named `taxi_zone_dictionary` and populates the dictionary from the CSV file in S3. The URL for the file is `https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv`. | ||
```sql | ||
CREATE DICTIONARY taxi_zone_dictionary | ||
( | ||
|
@@ -374,29 +394,22 @@ If you are new to ClickHouse, it is important to understand how ***dictionaries* | |
``` | ||
|
||
:::note | ||
Setting `LIFETIME` to 0 means this dictionary will never update with its source. It is used here to not send unnecessary traffic to our S3 bucket, but in general you could specify any lifetime values you prefer. | ||
|
||
For example: | ||
|
||
```sql | ||
LIFETIME(MIN 1 MAX 10) | ||
``` | ||
specifies the dictionary to update after some random time between 1 and 10 seconds. (The random time is necessary in order to distribute the load on the dictionary source when updating on a large number of servers.) | ||
Setting `LIFETIME` to 0 disables automatic updates to avoid unnecessary traffic to our S3 bucket. In other cases, you might configure it diffently. For details, see [Refreshing dictionary data using LIFETIME](/docs/en/sql-reference/dictionaries#refreshing-dictionary-data-using-lifetime). | ||
::: | ||
|
||
3. Verify it worked - you should get 265 rows (one row for each neighborhood): | ||
3. Verify it worked. The following should return 265 rows, or one row for each neighborhood: | ||
```sql | ||
SELECT * FROM taxi_zone_dictionary | ||
``` | ||
|
||
4. Use the `dictGet` function ([or its variations](./sql-reference/functions/ext-dict-functions.md)) to retrieve a value from a dictionary. You pass in the name of the dictionary, the value you want, and the key (which in our example is the `LocationID` column of `taxi_zone_dictionary`). | ||
|
||
For example, the following query returns the `Borough` whose `LocationID` is 132 (which as we saw above is JFK airport): | ||
For example, the following query returns the `Borough` whose `LocationID` is 132, which corresponds to JFK airport): | ||
```sql | ||
SELECT dictGet('taxi_zone_dictionary', 'Borough', 132) | ||
``` | ||
|
||
JFK is in Queens, and notice the time to retrieve the value is essentially 0: | ||
JFK is in Queens. Notice the time to retrieve the value is essentially 0: | ||
```response | ||
┌─dictGet('taxi_zone_dictionary', 'Borough', 132)─┐ | ||
│ Queens │ | ||
|
@@ -405,7 +418,7 @@ If you are new to ClickHouse, it is important to understand how ***dictionaries* | |
1 rows in set. Elapsed: 0.004 sec. | ||
``` | ||
|
||
5. Use the `dictHas` function to see if a key is present in the dictionary. For example, the following query returns 1 (which is "true" in ClickHouse): | ||
5. Use the `dictHas` function to see if a key is present in the dictionary. For example, the following query returns `1` (which is "true" in ClickHouse): | ||
```sql | ||
SELECT dictHas('taxi_zone_dictionary', 132) | ||
``` | ||
|
@@ -442,11 +455,11 @@ If you are new to ClickHouse, it is important to understand how ***dictionaries* | |
``` | ||
|
||
|
||
## 5. Perform a Join | ||
## Perform a join | ||
|
||
Let's write some queries that join the `taxi_zone_dictionary` with your `trips` table. | ||
Write some queries that join the `taxi_zone_dictionary` with your `trips` table. | ||
|
||
1. We can start with a simple JOIN that acts similarly to the previous airport query above: | ||
1. Start with a simple `JOIN` that acts similarly to the previous airport query above: | ||
```sql | ||
SELECT | ||
count(1) AS total, | ||
|
@@ -458,7 +471,7 @@ Let's write some queries that join the `taxi_zone_dictionary` with your `trips` | |
ORDER BY total DESC | ||
``` | ||
|
||
The response looks familiar: | ||
The response looks is identical to the `dictGet` query: | ||
```response | ||
┌─total─┬─Borough───────┐ | ||
│ 7053 │ Manhattan │ | ||
|
@@ -476,7 +489,7 @@ Let's write some queries that join the `taxi_zone_dictionary` with your `trips` | |
Notice the output of the above `JOIN` query is the same as the query before it that used `dictGetOrDefault` (except that the `Unknown` values are not included). Behind the scenes, ClickHouse is actually calling the `dictGet` function for the `taxi_zone_dictionary` dictionary, but the `JOIN` syntax is more familiar for SQL developers. | ||
::: | ||
|
||
2. We do not use `SELECT *` often in ClickHouse - you should only retrieve the columns you actually need! But it is difficult to find a query that takes a long time, so this query purposely selects every column and returns every row (except there is a built-in 10,000 row maximum in the response by default), and also does a right join of every row with the dictionary: | ||
2. This query returns rows for the the 1000 trips with the highest tip amount, then performs an inner join of each row with the dictionary: | ||
```sql | ||
SELECT * | ||
FROM trips | ||
|
@@ -486,14 +499,17 @@ Let's write some queries that join the `taxi_zone_dictionary` with your `trips` | |
ORDER BY tip_amount DESC | ||
LIMIT 1000 | ||
``` | ||
:::note | ||
Generally, we avoid using `SELECT *` often in ClickHouse. You should only retrieve the columns you actually need. However, in this example, we wanted it to be slow because why? | ||
::: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not clear to me why you'd want this query to be slow. |
||
|
||
|
||
#### Congrats! | ||
## Next steps | ||
|
||
Well done - you made it through the tutorial, and hopefully you have a better understanding of how to use ClickHouse. Here are some options for what to do next: | ||
Learn more about ClickHouse with the following documentation: | ||
|
||
- Read [how primary keys work in ClickHouse](./guides/best-practices/sparse-primary-indexes.md) - this knowledge will move you a long ways forward along your journey to becoming a ClickHouse expert | ||
- [Integrate an external data source](/docs/en/integrations/index.mdx) like files, Kafka, PostgreSQL, data pipelines, or lots of other data sources | ||
- [Connect your favorite UI/BI tool](./integrations/data-visualization/index.md) to ClickHouse | ||
- Check out the [SQL Reference](./sql-reference/index.md) and browse through the various functions. ClickHouse has an amazing collection of functions for transforming, processing and analyzing data | ||
- Learn more about [Dictionaries](/docs/en/sql-reference/dictionaries/index.md) | ||
- [Introduction to Primary Indexes in ClickHouse](./guides/best-practices/sparse-primary-indexes.md): Learn how ClickHouse uses sparse primary indexes to efficiently locate relevant data during queries. | ||
- [Integrate an external data source](/docs/en/integrations/index.mdx): Review data source integration options, including files, Kafka, PostgreSQL, data pipelines, and many others. | ||
- [Visualize data in ClickHouse](./integrations/data-visualization/index.md): Connect your favorite UI/BI tool to ClickHouse. | ||
- [SQL Reference](./sql-reference/index.md): Browse the SQL functions available in ClickHouse for transforming, processing and analyzing data. | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two giant screenshots seem extremely unnecessary and disruptive to the reader's experience.