diff --git a/docs/learn.md b/docs/learn.md index f7fac22..ac166fb 100644 --- a/docs/learn.md +++ b/docs/learn.md @@ -1,243 +1,239 @@ --- id: learn -title: Learn Flowpipe -sidebar_label: Learn Flowpipe +title: Learn Tailpipe +sidebar_label: Learn Tailpipe slug: / --- -# Learn Flowpipe +# Learn Tailpipe -Flowpipe allows you to create "pipelines as code" to define workflows and other tasks that run in a sequence. +Tailpipe is a high-performance data collection and querying tool that makes it easy to collect, store, and analyze log data. With Tailpipe, you can: -## Creating your first pipeline +- Collect logs from various sources and store them efficiently in parquet files +- Query your data using familiar SQL syntax through DuckDB +- Share collected data with your team using remote object storage +- Create filtered views of your data using schemas +- Join log data with other data sources for enriched analysis -Getting started is easy! If you haven't already done so, [download and install Flowpipe](/downloads). +## Install the NGINX Plugin -Flowpipe pipelines and triggers are packaged into [mods](/docs/build), and Flowpipe requires a mod to run. Let's create a new directory for our mod, and then run `flowpipe mod init` to initialize it: +This tutorial uses the NGINX plugin to demonstrate collecting and analyzing web server access logs. First, [download and install Tailpipe](/downloads), and then install the plugin: ```bash -mkdir learn_flowpipe -cd learn_flowpipe -flowpipe mod init +tailpipe plugin install nginx ``` -The `flowpipe mod init` command creates a file named `mod.fp` in the directory. This file contains a `mod` definition for our new mod: +## Configure Data Collection -```hcl -mod "local" { - title = "learn_flowpipe" -} -``` - -You can customize the [mod definition](/docs/flowpipe-hcl/mod) if you like, but the default is sufficient for our purposes. - -Let's create our first pipeline. - -Flowpipe mods are written in HCL. When Flowpipe runs, it will load the mod from the working directory and will read all files with the `.fp` extension from the directory and its subdirectories recursively. Create a file named `learn.fp` and add the following code: +Tailpipe uses HCL configuration files to define what data to collect. Create a file named `nginx.tpc` with the following content: ```hcl -pipeline "learn_flowpipe" { - step "http" "get_ipv4" { - url = "https://api.ipify.org?format=json" - } - - output "ip_address" { - value = step.http.get_ipv4.response_body.ip - } +partition "nginx_access_log" "web_servers" { + plugin = "nginx" + source "nginx_access_log_file" { + log_path = "/var/log/nginx/access.log" + } } ``` -A Flowpipe [pipeline](/docs/flowpipe-hcl/step/pipeline) is a sequence of steps to do work. This snippet creates a pipeline called `learn_flowpipe` that has a single [http step](/docs/flowpipe-hcl/step/http), and a single [output](/docs/flowpipe-hcl/step/pipeline#outputs). - -Let's run it! - -```bash -flowpipe pipeline run learn_flowpipe -``` +This configuration tells Tailpipe to collect NGINX access logs from the specified log file. The configuration defines: +- A partition named "web_servers" for the "nginx_access_log" table +- The source type "nginx_access_log_file" which reads NGINX formatted logs +- The path to the log file to collect from -![](/images/docs/learn/get-ipv4.png) -Flowpipe runs the pipeline and prints its outputs once it is complete. +## Collect Data -When troubleshooting, it's often useful to view more information about the currently executing steps. You can use the `--verbose` flag to show this detailed information. +Now let's collect the logs: ```bash -flowpipe pipeline run learn_flowpipe --verbose -``` - -![](/images/docs/learn/get-ipv4-verbose.png) - -## Using mods - -Flowpipe's modular design allows you to build pipelines from other pipelines. Let's install the `reallyfreegeoip` mod: +tailpipe plugin install nginx ```bash -flowpipe mod install github.com/turbot/flowpipe-mod-reallyfreegeoip +tailpipe collect nginx_access_log.web_servers ``` -```bash -Installed 1 mod: - -local -└── github.com/turbot/flowpipe-mod-reallyfreegeoip@v0.1.0 +This command will: +1. Read the NGINX access logs from the specified file +2. Parse and standardize the log entries +3. Store the data in parquet files organized by date +4. Update the local database with table definitions + +## Query Your Data + +Tailpipe provides an interactive SQL shell for analyzing your collected data. Let's look at some examples of what you can do. + +### Analyze Traffic by Server + +This query shows a summary of traffic for each server for a specific date: + +```sql +SELECT + tp_index as server, + count(*) as requests, + count(distinct remote_addr) as unique_ips, + round(avg(bytes_sent)) as avg_bytes, + count(CASE WHEN status = 200 THEN 1 END) as success_count, + count(CASE WHEN status >= 500 THEN 1 END) as error_count, + round(avg(CASE WHEN method = 'GET' THEN bytes_sent END)) as avg_get_bytes +FROM nginx_access_log +WHERE tp_date = '2024-11-01' +GROUP BY tp_index +ORDER BY requests DESC; ``` -The mod is installed into the `.flowpipe/mods` subdirectory, and a dependency is added to your `mod.fp`. - -Now that the mod is installed, you should see its pipelines: - -```bash -flowpipe pipeline list ``` - -```bash -MOD NAME DESCRIPTION -local learn_flowpipe -reallyfreegeoip reallyfreegeoip.pipeline.get_ip_geolocation Get geolocation data for an IPv4 or IPv6 address. +┌──────────────────────────────────────────────────────────────────────────────────────┐ +│ server requests unique_ips avg_bytes success_c… error_cou… avg_get_b… │ +│──────────────────────────────────────────────────────────────────────────────────── │ +│ web-01.ex… 349 346 7036 267 7 7158 │ +│ web-02.ex… 327 327 6792 246 11 6815 │ +│ web-03.ex… 324 322 7001 254 8 6855 │ +└──────────────────────────────────────────────────────────────────────────────────────┘ ``` -You can run pipelines from the dependency mod on the command line: - -```bash -flowpipe pipeline run reallyfreegeoip.pipeline.get_ip_geolocation --arg ip_address=35.236.238.30 +This shows us: +- Number of requests per server +- Count of unique IP addresses +- Average response size +- Success and error counts +- Average size of GET requests + +### Time-Oriented Query + +Let's look at some recent log entries: + +```sql +SELECT + tp_date, + tp_index as server, + remote_addr as ip, + method, + uri, + status, + bytes_sent +FROM nginx_access_log +WHERE tp_date = '2024-11-01' +LIMIT 10; ``` -![](/images/docs/learn/reallyfreegeoip.png) - -## Composing with pipelines - -While running the dependency pipelines directly in the CLI is useful, the real power is the ability to compose pipelines from other pipelines. Let's add a [pipeline step](/docs/flowpipe-hcl/step/pipeline) to take our IP address and look up our geo-location information. - -```hcl -pipeline "learn_flowpipe" { - step "http" "get_ipv4" { - url = "https://api.ipify.org?format=json" - } - - step "pipeline" "get_geo" { - pipeline = reallyfreegeoip.pipeline.get_ip_geolocation - args = { - ip_address = step.http.get_ipv4.response_body.ip - } - } - - output "ip_address" { - value = step.http.get_ipv4.response_body.ip - } - - output "latitude" { - value = step.pipeline.get_geo.output.geolocation.latitude - } - - output "longitude" { - value = step.pipeline.get_geo.output.geolocation.longitude - } -} +``` ++--------------------------------------------------------------------------------------+ +¦ tp_date server ip method uri status bytes_sent¦ +¦------------------------------------------------------------------------------------ ¦ +¦ 2024-11-01 web-01.example 220.50.48.32 GET /profile/user 200 5704 ¦ +¦ 2024-11-01 web-01.example 10.166.12.45 GET /blog/post/1 200 2341 ¦ +¦ 2024-11-01 web-01.example 203.0.113.10 GET /dashboard 200 11229 ¦ +¦ 2024-11-01 web-01.example 45.211.16.72 PUT /favicon.ico 301 2770 ¦ +¦ 2024-11-01 web-01.example 66.171.35.91 POST /static/main 503 5928 ¦ +¦ 2024-11-01 web-01.example 64.152.79.83 GET /logout 200 3436 ¦ +¦ 2024-11-01 web-01.example 156.25.84.12 GET /static/main 200 12490 ¦ +¦ 2024-11-01 web-01.example 78.131.22.45 GET /static/main 200 8342 ¦ +¦ 2024-11-01 web-01.example 203.0.113.10 POST /api/v1/user 200 3123 ¦ +¦ 2024-11-01 web-01.example 10.74.127.93 POST / 200 7210 ¦ ++--------------------------------------------------------------------------------------+ ``` -Notice that we used the IP address from the first step (`step.http.get_ipv4.response_body.ip`) as an argument to the second step. Flowpipe automatically detects this dependency and runs the steps in the correct order! +Because we specified `tp_date = '2024-11-01'`, Tailpipe only needs to read the parquet files in the corresponding date directories. Similarly, if you wanted to analyze traffic for a specific server, you could add `tp_index = 'web-01.example.com'` to your WHERE clause, and Tailpipe would only read files from that server's directory. -Let's add a couple more steps to get the weather forecast for our location. -```hcl -pipeline "learn_flowpipe" { - step "http" "get_ipv4" { - url = "https://api.ipify.org?format=json" - } +## Understanding Data Storage - step "pipeline" "get_geo" { - pipeline = reallyfreegeoip.pipeline.get_ip_geolocation +Tailpipe uses a hive-partitioned storage structure that organizes data for efficient querying. Let's look at how data is stored: - args = { - ip_address = step.http.get_ipv4.response_body.ip - } - } - - step "http" "get_weather" { - url = join("", [ - "https://api.open-meteo.com/v1/forecast", - "?latitude=${step.pipeline.get_geo.output.geolocation.latitude}", - "&longitude=${step.pipeline.get_geo.output.geolocation.longitude}", - "¤t=temperature", - "&forecast_days=1", - "&daily=temperature_2m_min,temperature_2m_max,precipitation_probability_mean", - "&temperature_unit=${step.pipeline.get_geo.output.geolocation.country_code == "US" ? "fahrenheit" : "celsius"}" - ]) - } - - step "transform" "friendly_forecast" { - value = join("", [ - "It is currently ", - step.http.get_weather.response_body.current.temperature, - step.http.get_weather.response_body.current_units.temperature, - ", with a high of ", - step.http.get_weather.response_body.daily.temperature_2m_max[0], - step.http.get_weather.response_body.daily_units.temperature_2m_max, - " and a low of ", - step.http.get_weather.response_body.daily.temperature_2m_min[0], - step.http.get_weather.response_body.daily_units.temperature_2m_min, - ". There is a ", - step.http.get_weather.response_body.daily.precipitation_probability_mean[0], - step.http.get_weather.response_body.daily_units.precipitation_probability_mean, - " chance of precipitation." - ]) - } - - output "ip_address" { - value = step.http.get_ipv4.response_body.ip - } - - output "latitude" { - value = step.pipeline.get_geo.output.geolocation.latitude - } - - output "longitude" { - value = step.pipeline.get_geo.output.geolocation.longitude - } - - output "forecast" { - value = step.transform.friendly_forecast.value - } -} +``` ++-- default + +-- nginx_access_log + ¦ +-- tp_partition=nginx_access_log + ¦ +-- tp_index=web-01.example.com + ¦ ¦ +-- tp_date=2024-11-01 + ¦ ¦ +-- file_a7d40b4a-0398-46c6-8869-dc5dd87015a0.parquet + ¦ +-- tp_index=web-02.example.com + ¦ ¦ +-- tp_date=2024-11-01 + ¦ ¦ +-- file_696946fa-f636-4b54-a8ec-82f64704ff50.parquet + ¦ +-- tp_index=web-03.example.com + ¦ +-- tp_date=2024-11-01 + ¦ +-- file_a061d992-eb86-46a5-bc8f-d1a4b2fcce25.parquet + +-- pipes_audit_log + ¦ +-- tp_partition=pipes_audit_log + ¦ +-- tp_index=turbot-ops + ¦ +-- tp_date=2024-11-05 + ¦ +-- file_ebab33de-2dcc-437c-8722-e371316f0b22.parquet + +-- tailpipe.db ``` -![](/images/docs/learn/weather-report.png) - +The structure has several key components: +- **Partition**: Groups data by source (e.g., `nginx_access_log`) +- **Index**: Sub-divides data by a meaningful key (e.g., server name for NGINX logs) +- **Date**: Further partitions data by date +- Each partition contains parquet files with the actual log data +This hierarchical structure enables efficient querying through partition pruning. When you query with conditions on `tp_partition`, `tp_index`, or `tp_date`, Tailpipe (and DuckDB) can skip reading irrelevant parquet files entirely. -## Send a message -Now we have a pipeline that can get the local forecast - let's send it somewhere! The [message step](/docs/flowpipe-hcl/step/message) provides a mechanism for sending messages via multiple communication channels, such as Slack and Email. +### Using DuckDB Directly -Add this step to the `learn_flowpipe` pipeline. +Since Tailpipe stores data in standard parquet files using a hive partitioning scheme, you can query the data directly with DuckDB: -```hcl - step "message" "send_forecast" { - notifier = notifier.default - subject = "Todays Forecast" - text = step.transform.friendly_forecast.value - } +```sql +$ cd ~/.tailpipe/data/default/tailpipe.db +$ duckdb tailpipe.db +D SELECT + tp_date, + tp_index as server, + count(*) as requests + FROM nginx_access_log + WHERE tp_date = '2024-11-01' + GROUP BY tp_date, tp_index; ``` -And run the pipeline again. - -```bash -flowpipe pipeline run learn_flowpipe +This flexibility means you can: +- Use your favorite DuckDB client to analyze the data +- Write scripts that process the data directly +- Import the data into other tools that support parquet files +- Build automated reporting systems around the collected data + +## Join with External Data + +One of Tailpipe's powerful features is the ability to join log data with other tables. Here's an example joining with an IP information table to get more context about the traffic: + +```sql +SELECT + n.remote_addr as ip, + i.description, + count(*) as requests, + count(distinct n.server_name) as servers_accessed, + round(avg(n.bytes_sent)) as avg_bytes, + string_agg(distinct n.method, ', ') as methods_used, + count(CASE WHEN n.status >= 400 THEN 1 END) as errors +FROM nginx_access_log n +LEFT JOIN ip_info i ON n.remote_addr = i.ip_address +WHERE i.description IS NOT NULL +GROUP BY n.remote_addr, i.description +ORDER BY requests DESC; ``` -You should see the message printed to the console when you run the pipeline. - -Console messages and inputs are useful, but Flowpipe can also route these input requests, approvals and notifications to external systems like Slack, MS Teams, and Email! - -Flowpipe [Integrations](/docs/reference/config-files/integration) allow you to interface with external systems. [Notifiers](/docs/reference/config-files/notifier) allow you to route [message](/docs/flowpipe-hcl/step/message) and [input](/docs/build/input) steps to one or more integrations. Integrations are only loaded in [server-mode](/docs/run/server). +``` +┌──────────────────────────────────────────────────────────────────────────────────────┐ +│ ip descripti… requests servers_a… avg_bytes methods_u… errors │ +│──────────────────────────────────────────────────────────────────────────────────── │ +│ 203.0.113… Test Netw… 1 1 1860 GET 0 │ +└──────────────────────────────────────────────────────────────────────────────────────┘ +``` - -Flowpipe server creates a default [`http` integration](/docs/reference/config-files/integration/http) as well as a [default notifier](/docs/reference/config-files/notifier#default-notifier) that routes to it, but you can send it via [Email](/docs/reference/config-files/integration/email), [Slack](/docs/reference/config-files/integration/slack) or [Microsoft Teams](/docs/reference/config-files/integration/msteams) without modifying the pipeline code. Just create the appropriate [integrations](/docs/reference/config-files/integration), add them to the [default notifier](/docs/reference/config-files/notifier#default-notifier), and run the pipeline again from a server instance! +This enriched query shows: +- IP addresses and their descriptions +- How many servers each IP accessed +- Average response sizes +- HTTP methods used +- Error counts -```bash -flowpipe server & -flowpipe pipeline run learn_flowpipe --host local -``` +## What's Next? -![](/images/docs/learn/slack-weather-report.png) +We've demonstrated basic log collection and analysis with Tailpipe. Here's what to explore next: +- [Discover more plugins on the Hub →](https://hub.steampipe.io/plugins) +- [Learn about data compaction and optimization →](https://tailpipe.io/docs/managing/compaction) +- [Share data with your team using remotes →](https://tailpipe.io/docs/sharing/remotes) +- [Create schemas for filtered views →](https://tailpipe.io/docs/schemas) +- [Join #tailpipe on Slack →](https://turbot.com/community/join)