Skip to content

Commit

Permalink
annotated dockerfiles
Browse files Browse the repository at this point in the history
Signed-off-by: Clemens Vasters <[email protected]>
  • Loading branch information
clemensv committed Sep 17, 2024
1 parent f174720 commit 557a8f4
Show file tree
Hide file tree
Showing 15 changed files with 4,668 additions and 1 deletion.
164 changes: 164 additions & 0 deletions gtfs/CONTAINER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# GTFS and GTFS-RT API bridge to Apache Kafka, Azure Event Hubs, and Fabric Event Streams

This container image provides a bridge between GTFS and GTFS-RT APIs and Apache
Kafka, Azure Event Hubs, and Fabric Event Streams. The bridge reads data from
GTFS and GTFS-RT APIs and writes data to a Kafka topic.

GTFS stands for General Transit Feed Specification. GTFS is a set of open data
standards for public transportation schedules and associated geographic
information. GTFS-RT is a real-time extension to GTFS that allows public
transportation agencies to provide real-time updates about their fleet. Over 2000
transit agencies worldwide provide GTFS and GTFS-RT data.

The [Mobility Database](mobilitydatabase.org) provides a comprehensive list of GTFS
and GTFS-RT feeds from around the world.

The bridge fetches GTFS and GTFS-RT data from the supplied URLs and writes the
data to a Kafka topic as [CloudEvents](https://cloudevents.io/) in a JSON format that
is documented in [EVENTS.md](EVENTS.md). One bridge instance can handle multiple
GTFS and GTFS-RT feeds, but they are managed under a single agency identifier that
must be supplied in the configuration.

## Installing the container image

Install from the command line as
```shell
$ docker pull ghcr.io/clemensv/real-time-sources-gtfs:latest
```

Use as base image in Dockerfile:

```Dockerfile
FROM ghcr.io/clemensv/real-time-sources-gtfs:latest
```

## Using the container image

The container image defines a single command that starts the bridge. The bridge
reads data from GTFS and GTFS-RT APIs and writes data to a Kafka topic or
Azure Event Hubs or Fabric Event Streams.

### With a Kafka broker

The image assumes that you have a Kafka broker that is configured with TLS and
SASL PLAIN authentication. You can start the container locally on docker with
the following command:

```shell
$ docker run --rm \
-e KAFKA_BOOTSTRAP_SERVERS='<kafka-bootstrap-servers>' \
-e KAFKA_TOPIC='<kafka-topic>' \
-e SASL_USERNAME='<sasl-username>' \
-e SASL_PASSWORD='<sasl-password>' \
-e GTFS_RT_URLS='<gtfs-rt-urls>' \
-e GTFS_URLS='<gtfs-urls>' \
-e AGENCY='<agency-id>' \
ghcr.io/clemensv/real-time-sources-gtfs:latest
```

### With Azure Event Hubs or Fabric Event Streams

With Azure Event Hubs or Fabric Event Streams, you can use the connection string
to establish a connection to the service. You get the connection string from the
Azure portal, from the Azure CLI, or from the "custom endpoint" of a Fabric
Event Stream.

```shell
$ docker run --rm \
-e CONNECTION_STRING='<connection-string>' \
-e GTFS_RT_URLS='<gtfs-rt-urls>' \
-e GTFS_URLS='<gtfs-urls>' \
-e AGENCY='<agency-id>' \
ghcr.io/clemensv/real-time-sources-gtfs:latest
```


### Preserve the cache directory

If you want to preserve state between restarts to avoid sending that you have
already sent, you can mount a volume to the container and refer to it in the
`SCHEDULE_CACHE_DIR` environment variable:

```shell
$ docker run --rm \
-v /path/to/state:/var/lib/real-time-sources-gtfs \
... other args... \
-e SCHEDULE_CACHE_DIR='/var/lib/real-time-sources-gtfs' \
ghcr.io/clemensv/real-time-sources-gtfs:latest
```

## Environment Variables

### `CONNECTION_STRING`

Azure Event Hubs-style connection string is used to establish a connection to
Azure Event Hubs or Fabric Event Streams. This connection string contains the
necessary credentials and configuration settings required for secure and
efficient data transmission. The connection string is used in place of
`KAFKA_BOOTSTRAP_SERVERS` and `SASL_USERNAME` and `SASL_PASSWORD`.

### `KAFKA_BOOTSTRAP_SERVERS`

Address of the Kafka broker. This is a comma-separated list of host and port
pairs that are the addresses of the Kafka brokers in a "bootstrap" Kafka cluster
that a Kafka client connects to initially to bootstrap itself. For example,
`broker1:9092,broker2:9092`. The client assumes communication with TLS enabled
Kafka brokers.

### `KAFKA_TOPIC`

Kafka topic to produce messages to.

### `SASL_USERNAME`

Username for SASL PLAIN authentication. The client assumes that the Kafka
brokers have been configured to support SASL PLAIN authentication.

### `SASL_PASSWORD`

Password for SASL PLAIN authentication. The client assumes that the Kafka
brokers have been configured to support SASL PLAIN authentication.

### `GTFS_RT_URLS`

Comma-separated list of GTFS-RT URLs for real-time vehicle positions and/or trip
updates and/or alerts. The supplied endpoints may support any combination of the
GTFS-RT feeds. The client fetches GTFS-RT data from all of these URLs.

### `GTFS_RT_HEADERS`

HTTP headers to pass to the GTFS-RT endpoint. The headers are passed as
key=value pairs separated by spaces. For example, `key1=value1 key2=value2`.
Values may be quoted to preserve spaces inside values. For example,
`key1="value with spaces"`. The client passes these headers to the GTFS-RT
endpoint when fetching GTFS-RT data.

### `GTFS_URLS`

Comma-separated list of GTFS URLs for schedule and other reference data. The
client fetches GTFS data from all of these URLs.

### `GTFS_HEADERS`

HTTP headers to pass to the schedule endpoint. The headers are passed as
key=value pairs separated by spaces. For example, `key1=value1 key2=value2`.
Values may be quoted to preserve spaces inside values. For example,
`key1="value with spaces"`. The client passes these headers to the schedule
endpoint when fetching GTFS data.

### `AGENCY`

Agency identifier. The client uses this identifier to group GTFS and GTFS-RT
data under a single agency. The client includes this identifier in the CloudEvent
`source` field.

### `ROUTE`

Route filter. Reserved.

### `SCHEDULE_CACHE_DIR`

Schedule file cache directory. The client caches GTFS schedule files in this
directory and keeps track of the last modification time of the files. This
assures that data is not acquired and sent to the Kafka broker if the schedule
files have not changed since the last fetch.
6 changes: 6 additions & 0 deletions gtfs/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# Use an official Python runtime as a parent image
FROM python:3.10-slim

LABEL org.opencontainers.image.source = "https://github.com/clemensv/real-time-sources/tree/main/gtfs"
LABEL org.opencontainers.image.title = "GTFS and GTFS-RT API bridge to Kafka endpoints"
LABEL org.opencontainers.image.description = "This container is a bridge between GTFS and GTFS-RT API and Kafka endpoints. It fetches data from GTFS/GTFS-RT endpoints and forwards them to the configured Kafka endpoint."
LABEL org.opencontainers.image.documentation = "https://github.com/clemensv/real-time-sources/blob/main/gtfs/CONTAINER.md"
LABEL org.opencontainers.image.license = "MIT"

# Set the working directory in the container
WORKDIR /app

Expand Down
Loading

0 comments on commit 557a8f4

Please sign in to comment.