From 7e3a8ad38ec631feeadbbbd4f488bc1d5306828a Mon Sep 17 00:00:00 2001 From: Clemens Vasters Date: Thu, 19 Sep 2024 09:32:35 +0200 Subject: [PATCH] doc and build script updates Signed-off-by: Clemens Vasters --- README.md | 9 +- gtfs/README.md | 165 +++++++++++++++++----------- nextbus/nextbus/__init__.py | 4 +- nextbus/pyproject.toml | 5 +- noaa/README.md | 119 ++++++++++++-------- noaa/pyproject.toml | 2 +- pegelonline/README.md | 112 +++++++++++++------ pegelonline/pegelonline/__init__.py | 4 +- pegelonline/pyproject.toml | 2 +- rss/README.md | 165 +++++++++++++++++++++++++++- rss/pyproject.toml | 2 +- rss/rssbridge/__init__.py | 4 +- rss/rssbridge/rssbridge.py | 7 +- tools/generate-events-md.ps1 | 21 +++- tools/install-avrotize.ps1 | 20 ++++ tools/printdoc.py | 21 ++++ tools/run-kql-script.ps1 | 25 +++++ 17 files changed, 521 insertions(+), 166 deletions(-) diff --git a/README.md b/README.md index c837226..dd474ab 100644 --- a/README.md +++ b/README.md @@ -64,4 +64,11 @@ The [Pegelonline data poller](pegelonline/README.md) is a command line tool that can be used to retrieve real-time water level and current data from the German Federal Waterways and Shipping Administration (WSV) Pegelonline API. The data is available for over 3000 stations in Germany. The Pegelonline data is updated -every 15 minutes, and the data volume is relatively low. \ No newline at end of file +every 15 minutes, and the data volume is relatively low. + +### Nextbus - Public transport data + +The [Nextbus tool](nextbus/README.md) is a command line tool that can be used to +retrieve real-time data from the [Nextbus](https://www.nextbus.com/) service and +feed that data into Azure Event Hubs and Microsft Fabric Event Streams. The tool +can also be used to query the Nextbus service interactively. \ No newline at end of file diff --git a/gtfs/README.md b/gtfs/README.md index a3c4348..7f4f2f9 100644 --- a/gtfs/README.md +++ b/gtfs/README.md @@ -1,16 +1,35 @@ -# GTFS Real Time CLI tool +# GTFS and GTFS-RT API Bridge Usage Guide -The GTFS real time tool is a command line tool that can be used to retrieve real time data from GTFS Real Time endpoints and feed that data into Azure Evnt Hubs and Microsft Fabric Event Streams. +## Overview -GTFS Real Time is a standard for exchanging real time data about public transit systems. The standard is maintained by Google and is described [here](https://developers.google.com/transit/gtfs-realtime/). +**GTFS and GTFS-RT API Bridge** is a tool that fetches GTFS (General Transit Feed Specification) Realtime and Static data from various transit agency sources, processes the data, and publishes it to Kafka topics using SASL PLAIN authentication. This tool can be integrated with systems like Microsoft Event Hubs or Microsoft Fabric Event Streams. -## Prerequisites +GTFS is a set of open data standards for public transportation schedules and +associated geographic information. GTFS-RT is a real-time extension to GTFS that +allows public transportation agencies to provide real-time updates about their +fleet. Over 2000 transit agencies worldwide provide GTFS and GTFS-RT data. -The tool is written in Python and requires Python 3.10 or later. You can download Python from [here](https://www.python.org/downloads/). You also need to install the `git` command line tool. You can download `git` from [here](https://git-scm.com/downloads). +The [Mobility Database](mobilitydatabase.org) provides a comprehensive list of GTFS +and GTFS-RT feeds from around the world. + +## Key Features: +- **GTFS-RT Data Polling**: Poll GTFS Realtime feeds for vehicle positions, trip updates, and alerts. +- **GTFS Static Data Processing**: Fetch GTFS static data (routes, stops, schedules) and send it to Kafka topics. +- **Kafka Integration**: Supports sending data to Kafka topics using SASL PLAIN authentication. ## Installation -Install the tool from the command line as follows: +The tool is written in Python and requires Python 3.10 or later. You can download Python from [here](https://www.python.org/downloads/) or from the Microsoft Store if you are on Windows. + +### Installation Steps + +Once Python is installed, you can install the tool from the command line as follows: + +```bash +pip install git+https://github.com/clemensv/real-time-sources#subdirectory=gtfs +``` + +If you clone the repository, you can install the tool as follows: ```bash git clone https://github.com/clemensv/real-time-sources.git @@ -18,91 +37,103 @@ cd real-time-sources/gtfs pip install . ``` -A package install will be available later. +For a packaged install, consider using the [CONTAINER.md](CONTAINER.md) instructions. -## Usage +## How to Use -```bash -options: - -h, --help show this help message and exit - -subcommands: - {feed,vehicle-locations} - feed poll vehicle locations and submit to an Event Hub - vehicle-locations get the vehicle locations for a route - -``` +After installation, the tool can be run using the `gtfs` command. It supports several arguments for configuring the polling process and sending data to Kafka. +The events sent to Kafka are formatted as CloudEvents, documented in [EVENTS.md](EVENTS.md). -### Vehicle Locations +### `feed` Command-Line Arguments -This command returns the vehicle locations from the feed. +- `--kafka-bootstrap-servers`: Comma-separated list of Kafka bootstrap servers. +- `--kafka-topic`: The Kafka topic to send messages to. +- `--sasl-username`: Username for SASL PLAIN authentication. +- `--sasl-password`: Password for SASL PLAIN authentication. +- `--connection-string`: Microsoft Event Hubs or Microsoft Fabric Event Stream connection string (overrides other Kafka parameters). +- `--gtfs-rt-urls`: URL(s) for GTFS Realtime feeds. +- `--gtfs-urls`: URL(s) for GTFS Static schedule feeds. +- `--mdb-source-id`: Mobility Database source ID for GTFS Realtime or Static feeds. +- `--agency`: Agency ID to poll data for. +- `--route`: (Optional) Route ID to poll data for. If not provided, data for all routes will be polled. +- `--poll-interval`: Interval in seconds to wait between polling vehicle locations. +- `--force-schedule-refresh`: Force a refresh of the GTFS schedule data. -* `--agency `: the name of the agency (required) -* `--gtfs-url `: the URL of the agency's GTFS real-time endpoint (required)` -* `--header `: extra HTTP header to use for the request to the GTFS real-time endpoint. Can be specified multiple times. (optional) +### Example Usage +#### Poll GTFS-RT and Send Data to Kafka ```bash -gtfs-cli vehicle-locations --agency --gtfs-url [--header ] +gtfs feed --connection-string "" ``` -The output is a list of vehicle locations, one per line. Each line contains the following information: +#### Poll a Specific Route for Vehicle Data +```bash +gtfs feed --connection-string "" --route "" +``` -* the vehicle id -* the vehicle location as a latitude/longitude pair -* the vehicle heading in degrees -* the vehicle speed in km/h -* a link to a map showing the vehicle location +#### Using Kafka Parameters Directly +If you do not want to use a connection string, you can provide the Kafka parameters directly: +```bash +gtfs feed --kafka-bootstrap-servers "" --kafka-topic "" --sasl-username "" --sasl-password "" +``` -### Feed +### Connection String for Microsoft Event Hubs or Fabric Event Streams -This command polls the GTFS real-time endpoint for vehicle locations and submits them to an Azure Event Hub -instance or to a Microsoft Fabric Event Stream. The command requires the following parameters: +You can provide a **connection string** for Microsoft Event Hubs or Microsoft Fabric Event Streams to simplify the configuration by consolidating the Kafka bootstrap server, topic, username, and password. -* `--agency `: the name of the agency (required) -* `--gtfs-url `: the URL of the agency's GTFS real-time endpoint (required)` -* `--header `: extra HTTP header to use for the request to the GTFS real-time endpoint. Can be specified multiple times. (optional) -* `--feed-connection-string `: the connection string for the Azure Event Hub instance that receives the - vehicle locations. The connection string must include the `Send` policy. (required) -* `--feed-event-hub-name `: the name of the Event Hub instance that receives the vehicle locations (required) -* `--poll-interval `: the interval in seconds between polls of the GTFS feed service. The default is 10 seconds. (optional) +#### Format: +``` +Endpoint=sb://.servicebus.windows.net/;SharedAccessKeyName=;SharedAccessKey=;EntityPath= +``` + +### Additional Commands -The connection information for the Event Hub instances can be found in the Azure portal. The connection string -is available in the "Shared access policies" section of the Event Hub instance. The Event Hub name is the name -of the Event Hub instance. +#### Print GTFS Realtime Feed Data +Prints the GTFS-RT data for a single request: + +```bash +gtfs printfeed --gtfs-rt-url "" +``` -The feed command will run until interrupted with `Ctrl-C`. +#### List Agencies +Lists the agencies in the Mobility Database: ```bash -gtfs-cli feed --agency --route --feed-connection-string --feed-event-hub-name +gtfs agencies ``` -### "Feed" Event Hub output +#### List Routes +Lists the routes from a GTFS Static feed: -The output into the "feed" Event Hub are CloudEvent messages with the `type` attribute -set to `gtfs.vehiclePosition`. The `subject` attribute is set to `{agency_tag}/{vehicle_id}`. +```bash +gtfs routes --gtfs-url "" +``` + +#### List Stops +Lists the stops for a given route: -#### gtfs.vehiclePosition +```bash +gtfs stops --route "" --gtfs-url "" +``` -The `data`of the CloudEvent message is a JSON object with the following attributes: +## Environment Variables -* `agency`: the agency tag -* `routeTag`: the route tag -* `dirTag`: the direction tag -* `id`: the vehicle id -* `lat`: the vehicle location latitude -* `lon`: the vehicle location as a longitude -* `predictable`: whether the vehicle location is predictable -* `heading`: the vehicle heading in degrees -* `speedKmHr`: the vehicle speed in km/h -* `timestamp`: the timestamp of the vehicle location +You can avoid passing parameters via the command line by setting the following environment variables: +- `KAFKA_BOOTSTRAP_SERVERS`: List of Kafka bootstrap servers. +- `KAFKA_TOPIC`: Kafka topic to send messages to. +- `SASL_USERNAME`: Username for SASL PLAIN authentication. +- `SASL_PASSWORD`: Password for SASL PLAIN authentication. +- `CONNECTION_STRING`: Microsoft Event Hubs or Microsoft Fabric Event Stream connection string. +- `GTFS_RT_URLS`: Comma-separated list of GTFS Realtime feed URLs. +- `GTFS_URLS`: Comma-separated list of GTFS Static schedule feed URLs. +- `MDB_SOURCE_ID`: Mobility Database source ID for the GTFS feed. +- `AGENCY`: Agency ID to poll data for. -## Public GTFS Real time feeds with vehicle positions +### CloudEvents Mode +You can specify the CloudEvents mode (either `structured` or `binary`) when sending data to Kafka: -| Agency | URL | Documentation | -|--------|-----|---------------| -| New York City MTA Bus Time, US | http://gtfsrt.prod.obanyc.com/vehiclePositions?key={key} | https://bustime.mta.info/wiki/Developers/GTFSRt | -| Catalunya FGC, ES |https://fgc.opendatasoft.com/explore/dataset/vehicle-positions-gtfs_realtime/files/d286964db2d107ecdb1344bf02f7b27b/download/ | https://data.europa.eu/data/datasets/https-analisi-transparenciacatalunya-cat-api-views-y6iv-pycv?locale=en | -| Brest métropole, FR | https://www.data.gouv.fr/fr/datasets/r/d5d43e1e-af62-4811-8a4e-ca14ad4209c8 | https://data.europa.eu/data/datasets/55ffbe0888ee387348ccb97d?locale=en | -| ALEOP (regional transport in Pays de la Loire), FR | https://www.data.gouv.fr/fr/datasets/r/b78c6d8a-3145-4deb-b68a-9b6fc9af7a89 | https://data.europa.eu/data/datasets/632b2c56696ec36c7f4811c8?locale=en | +```bash +gtfs feed --cloudevents-mode structured +``` \ No newline at end of file diff --git a/nextbus/nextbus/__init__.py b/nextbus/nextbus/__init__.py index 41bcfac..7bfdce4 100644 --- a/nextbus/nextbus/__init__.py +++ b/nextbus/nextbus/__init__.py @@ -1,5 +1,5 @@ # __init.py__ -from . import nextbus +from .nextbus import main if __name__ == "__main__": - nextbus.main() + main() diff --git a/nextbus/pyproject.toml b/nextbus/pyproject.toml index beda1da..6cdcd6d 100644 --- a/nextbus/pyproject.toml +++ b/nextbus/pyproject.toml @@ -7,6 +7,7 @@ name = "nextbus" version = "0.1.0" description = "A project to fetch data from NextBus API" authors = ["Clemens Vasters "] +readme = "README.md" [tool.poetry.dependencies] python = "^3.10" @@ -14,5 +15,5 @@ requests = "^2.31.0" azure-eventhub = "^5.11.3" cloudevents = "^1.9.0" -[build-system.scripts] -nextbus-cli = "nextbus.nextbus:main" +[tool.poetry.scripts] +nextbus = "nextbus:main" diff --git a/noaa/README.md b/noaa/README.md index 73aec67..1a82646 100644 --- a/noaa/README.md +++ b/noaa/README.md @@ -1,80 +1,105 @@ -# NOAA Data Poller +# NOAA Data Poller Usage Guide -The NOAA Data Poller is a tool designed to periodically fetch data from NOAA (National Oceanic and Atmospheric Administration) and send it to a specified Apache Kafka topic using SASL PLAIN authentication. +## Overview -## Features +**NOAA Data Poller** is a tool designed to interact with the NOAA (National Oceanic and Atmospheric Administration) API to fetch real-time environmental data from various NOAA stations. The tool can retrieve data such as water levels, air temperature, wind, and predictions, and send this data to a Kafka topic using SASL PLAIN authentication, making it suitable for integration with systems like Microsoft Event Hubs or Microsoft Fabric Event Streams. -- Polls various NOAA data products including water level, air temperature, wind, air pressure, water temperature, and more. -- Sends the data to a Kafka topic in the form of CloudEvents. -- Uses SASL PLAIN authentication for Kafka communication. -- Stores the last polled times for each station and product to avoid duplicate data fetching. +## Key Features: +- **NOAA Data Polling**: Retrieve data for various NOAA products, including water levels, predictions, air temperature, wind, and more. +- **Station Support**: Poll data for all NOAA stations or specify a single station. +- **Kafka Integration**: Send NOAA data to a Kafka topic using SASL PLAIN authentication. -## Requirements +## Installation -- Python 3.8+ +The tool is written in Python and requires Python 3.10 or later. You can download Python from [here](https://www.python.org/downloads/) or from the Microsoft Store if you are on Windows. -## Installation +### Installation Steps -Install the required Python packages using pip: +Once Python is installed, you can install the tool from the command line as follows: -```sh -pip install requests confluent_kafka cloudevents +```bash +pip install git+https://github.com/clemensv/real-time-sources#subdirectory=noaa ``` -## Usage +If you clone the repository, you can install the tool as follows: -The NOAA Data Poller can be run from the command line. Below are the available command-line arguments: - -```sh -python noaa_data_poller.py --last-polled-file LAST_POLLED_FILE --kafka-bootstrap-servers KAFKA_BOOTSTRAP_SERVERS --kafka-topic KAFKA_TOPIC --sasl-username SASL_USERNAME --sasl-password SASL_PASSWORD --connection-string CONNECTION_STRING +```bash +git clone https://github.com/clemensv/real-time-sources.git +cd real-time-sources/noaa +pip install . ``` -### Arguments +For a packaged install, consider using the [CONTAINER.md](CONTAINER.md) instructions. + +## How to Use + +After installation, the tool can be run using the `noaa` command. It supports several arguments for configuring the polling process and sending data to Kafka. + +The events sent to Kafka are formatted as CloudEvents, documented in [EVENTS.md](EVENTS.md). + +### Command-Line Arguments -- `--last-polled-file`: File to store the last polled times for each station and product. Default is `~/.noaa_last_polled.json`. +- `--last-polled-file`: Path to the file where the last polled times for each station and product are stored. Defaults to `~/.noaa_last_polled.json`. - `--kafka-bootstrap-servers`: Comma-separated list of Kafka bootstrap servers. -- `--kafka-topic`: Kafka topic to send messages to. +- `--kafka-topic`: The Kafka topic to send messages to. - `--sasl-username`: Username for SASL PLAIN authentication. - `--sasl-password`: Password for SASL PLAIN authentication. -- `--connection-string`: Microsoft Azure Event Hubs or Microsoft Fabric Event Streams connection string. +- `--connection-string`: Microsoft Event Hubs or Microsoft Fabric Event Stream connection string (overrides other Kafka parameters). +- `--station`: (Optional) Station ID to poll data for. If not provided, data for all stations will be polled. -### Example +### Example Usage -```sh -python noaa_data_poller.py --last-polled-file ~/.noaa_last_polled.json --kafka-bootstrap-servers your.kafka.server:9093 --kafka-topic noaa-data --sasl-username your_username --sasl-password your_password +#### Poll All Stations and Send Data to Kafka +```bash +noaa --connection-string "" ``` -## Environment Variables - -The tool can also be configured using environment variables as an alternative to command-line arguments. - -- `NOAA_LAST_POLLED_FILE` -- `KAFKA_BOOTSTRAP_SERVERS` -- `KAFKA_TOPIC` -- `SASL_USERNAME` -- `SASL_PASSWORD` -- `CONNECTION_STRING` - -## Logging and Error Handling +#### Poll a Specific Station and Send Data to Kafka +```bash +noaa --connection-string "" --station "" +``` -The tool logs progress and errors to the console. Ensure proper monitoring of logs for troubleshooting and maintenance. +#### Using Kafka Parameters Directly +If you do not want to use a connection string, you can provide the Kafka parameters directly: +```bash +noaa --kafka-bootstrap-servers "" --kafka-topic "" --sasl-username "" --sasl-password "" +``` -## Deploying as a Container to Azure Container Instances +### Connection String for Microsoft Event Hubs or Fabric Event Streams -The NOAA Data Poller can be deployed as a container to Azure Container Instances. +The tool supports providing a **connection string** for Microsoft Event Hubs or Microsoft Fabric Event Streams. This connection string simplifies the configuration by consolidating the Kafka bootstrap server, topic, username, and password. -[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fclemensv%2Freal-time-sources%2Fmain%2Fnoaa%2Fazure-template.json) +#### Format: +``` +Endpoint=sb://.servicebus.windows.net/;SharedAccessKeyName=;SharedAccessKey=;EntityPath= +``` -## Contributing +When provided, the connection string is parsed to extract the following details: +- **Bootstrap Servers**: Derived from the `Endpoint` value. +- **Kafka Topic**: Derived from the `EntityPath` value. +- **SASL Username and Password**: The username is set to `'$ConnectionString'`, and the password is the entire connection string. -Contributions are welcome. Please fork the repository and submit pull requests. +### Environment Variables -## License +The tool also supports the following environment variables to avoid passing them via the command line: +- `CONNECTION_STRING`: Microsoft Event Hubs or Microsoft Fabric Event Stream connection string. +- `NOAA_LAST_POLLED_FILE`: File to store the last polled times for each station and product. -This project is licensed under the MIT License. See the LICENSE file for details. +## NOAA Products Supported -## Contact +The following NOAA products are supported by the tool: +- **Water Level**: `water_level` +- **Predictions**: `predictions` +- **Air Temperature**: `air_temperature` +- **Wind**: `wind` +- **Air Pressure**: `air_pressure` +- **Water Temperature**: `water_temperature` +- **Conductivity**: `conductivity` +- **Visibility**: `visibility` +- **Humidity**: `humidity` +- **Salinity**: `salinity` -For any questions or issues, please open an issue in the repository or contact the maintainer. +## Data Management +The tool polls NOAA data periodically and saves the last polled time in a file. This ensures that the tool only fetches new data in subsequent polling cycles. \ No newline at end of file diff --git a/noaa/pyproject.toml b/noaa/pyproject.toml index b6cd5a3..cacc95c 100644 --- a/noaa/pyproject.toml +++ b/noaa/pyproject.toml @@ -23,5 +23,5 @@ pytest-cov = ">=5.0.0" testcontainers = ">=4.8.1" requests-mock = ">=1.12.1" -[build-system.scripts] +[tool.poetry.scripts] noaa = "noaa:main" diff --git a/pegelonline/README.md b/pegelonline/README.md index 2a4fb77..24ef157 100644 --- a/pegelonline/README.md +++ b/pegelonline/README.md @@ -1,64 +1,114 @@ -## Usage +# PegelOnline Usage Guide -To use the `pegelonline` tool, you can run the following commands directly from the command line: +## Overview -### 1. List All Stations +**PegelOnline** is a tool designed to interact with the German WSV PegelOnline API to fetch water level data for rivers in Germany. The tool can retrieve water level data from individual stations, list available stations, or continuously poll the API to send water level updates to a Kafka topic. -To retrieve a list of all available stations providing water level data: +## Key Features: +- **Water Level Fetching**: Retrieve current water level data for specific stations from the PegelOnline API. +- **Station Listing**: List all available monitoring stations. +- **Kafka Integration**: Send water level updates as CloudEvents to a Kafka topic, supporting Microsoft Event Hubs and Microsoft Fabric Event Streams. + +## Installation + +The tool is written in Python and requires Python 3.10 or later. You can download Python from [here](https://www.python.org/downloads/) or get it from the Microsoft Store if you are on Windows. + +### Installation Steps + +Once Python is installed, you can install the tool from the command line as follows: ```bash -pegelonline list +pip install git+https://github.com/clemensv/real-time-sources#subdirectory=pegelonline ``` -This command will output a list of all stations, each with its unique identifier (`uuid`) and short name. +If you clone the repository, you can install the tool as follows: + +```bash +git clone https://github.com/clemensv/real-time-sources.git +cd real-time-sources/pegelonline +pip install . +``` + +For a packaged install, consider using the [CONTAINER.md](CONTAINER.md) instructions. + +## How to Use + +After installation, the tool can be run using the `pegelonline` command. It supports multiple subcommands: +- **List Stations (`list`)**: Fetch and display all available monitoring stations. +- **Get Water Level (`level`)**: Retrieve the current water level for a specific station. +- **Feed Stations (`feed`)**: Continuously poll PegelOnline API for water levels and send updates to a Kafka topic. -### 2. Fetch Water Level for a Specific Station +### **List Stations (`list`)** -To get the current water level for a specific station, use the station's short name: +Fetches and displays all available monitoring stations from the PegelOnline API. + +#### Example Usage: ```bash -pegelonline level +pegelonline list ``` -Replace `` with the desired station's short name (e.g., `KOLN` for the Cologne station). +### **Get Water Level (`level`)** + +Retrieves the current water level for the specified station. -**Example:** +- `shortname`: The short name of the station to query. + +#### Example Usage: ```bash -pegelonline level KOLN +pegelonline level ``` -This will display the current water level measurement in a formatted JSON output. +### **Feed Stations (`feed`)** + +Polls the PegelOnline API for water level measurements and sends them as +CloudEvents to a Kafka topic. The events are formatted using CloudEvents +structured JSON format and described in [EVENTS.md](EVENTS.md). -### 3. Feed Water Level Updates to a Kafka Topic or Event Hub +- `--kafka-bootstrap-servers`: Comma-separated list of Kafka bootstrap servers. +- `--kafka-topic`: Kafka topic to send messages to. +- `--sasl-username`: Username for SASL PLAIN authentication. +- `--sasl-password`: Password for SASL PLAIN authentication. +- `--connection-string`: Microsoft Event Hubs or Microsoft Fabric Event Stream [connection string](#connection-string-for-microsoft-event-hubs-or-fabric-event-streams) (overrides other Kafka parameters). +- `--polling-interval`: Interval in seconds between API polling requests + (default is 60 seconds; the data is for most stations is only updated once + every 6 minutes, but different for each station). -To continuously stream water level updates to a specified Kafka topic or Microsoft Event Hub: +#### Example Usage: ```bash -pegelonline feed --kafka-bootstrap-servers --kafka-topic --sasl-username --sasl-password --polling-interval +pegelonline feed --kafka-bootstrap-servers "" --kafka-topic "" --sasl-username "" --sasl-password "" --polling-interval 60 ``` -Alternatively, you can use a connection string for Microsoft Event Hub: +Alternatively, using a connection string for Microsoft Event Hubs or Microsoft Fabric Event Streams: ```bash -pegelonline feed --connection-string --polling-interval +pegelonline feed --connection-string "" --polling-interval 60 ``` -#### Options +### Connection String for Microsoft Event Hubs or Fabric Event Streams -- `--kafka-bootstrap-servers `: Comma-separated list of Kafka bootstrap servers. -- `--kafka-topic `: Kafka topic to send messages to. -- `--sasl-username `: Username for SASL PLAIN authentication. -- `--sasl-password `: Password for SASL PLAIN authentication. -- `--connection-string `: Microsoft Event Hubs or Microsoft Fabric Event Stream connection string. -- `--polling-interval `: Polling interval in seconds (default is 60 seconds). +The connection string format is as follows: -#### Example +``` +Endpoint=sb://.servicebus.windows.net/;SharedAccessKeyName=;SharedAccessKey=;EntityPath= +``` -To stream updates to a Kafka topic: +When provided, the connection string is parsed to extract the Kafka configuration parameters: +- **Bootstrap Servers**: Derived from the `Endpoint` value. +- **Kafka Topic**: Derived from the `EntityPath` value. +- **SASL Username and Password**: The username is set to `'$ConnectionString'`, and the password is the entire connection string. -```bash -pegelonline feed --kafka-bootstrap-servers "your.kafka.server:9092" --kafka-topic "your-kafka-topic" --sasl-username "your-username" --sasl-password "your-password" --polling-interval 30 -``` +### Environment Variables +The tool supports the following environment variables to avoid passing them via the command line: +- `KAFKA_BOOTSTRAP_SERVERS`: Kafka bootstrap servers (comma-separated list). +- `KAFKA_TOPIC`: Kafka topic for publishing. +- `SASL_USERNAME`: SASL username for Kafka authentication. +- `SASL_PASSWORD`: SASL password for Kafka authentication. +- `CONNECTION_STRING`: Microsoft Event Hubs or Microsoft Fabric Event Stream connection string. +- `POLLING_INTERVAL`: Polling interval in seconds. + +## State Management -This command will continuously fetch water level data from all stations and send updates to the specified Kafka topic or Event Hub at the defined polling interval. \ No newline at end of file +The tool handles state internally for efficient API polling and sending updates. \ No newline at end of file diff --git a/pegelonline/pegelonline/__init__.py b/pegelonline/pegelonline/__init__.py index 1fe31ac..eee526f 100644 --- a/pegelonline/pegelonline/__init__.py +++ b/pegelonline/pegelonline/__init__.py @@ -1,5 +1,5 @@ # __init.py__ -from . import pegelonline +from .pegelonline import main if __name__ == "__main__": - pegelonline.main() + main() diff --git a/pegelonline/pyproject.toml b/pegelonline/pyproject.toml index 5952d8e..f38be35 100644 --- a/pegelonline/pyproject.toml +++ b/pegelonline/pyproject.toml @@ -24,5 +24,5 @@ pytest-cov = ">=5.0.0" testcontainers = ">=4.8.1" requests-mock = ">=1.12.1" -[build-system.scripts] +[tool.poetry.scripts] pegelonline = "pegelonline:main" diff --git a/rss/README.md b/rss/README.md index 6ddedbb..9f08d54 100644 --- a/rss/README.md +++ b/rss/README.md @@ -1,7 +1,160 @@ -# RSS Bridge +# RSS Bridge Usage Guide + +## Overview + +**RSS Bridge** is a tool designed to fetch and process RSS/Atom feeds or OPML +feed lists, with the ability to publish feed data to Microsoft Fabric Event +Streams via Kafka. It can also handle the periodic polling of feeds, maintain a +cache of processed feed states, and handle backoff and rate-limiting. + +## Installation + +The tool is written in Python and requires Python 3.10 or later. You can +download Python from [here](https://www.python.org/downloads/) or get it from +the Microsoft Store if you are on Windows. You may also need to install the +`git` command line tool. You can download `git` from +[here](https://git-scm.com/downloads). + +### Installation Steps + +Having Python, you can install the tool in one shot from the command line as follows: + +```bash +pip install git+https://github.com/clemensv/real-time-sources#subdirectory=gtfs +``` + +if you clone the repository, you can install the tool from the command line as follows: + +```bash +git clone https://github.com/clemensv/real-time-sources.git +cd real-time-sources/rss +pip install . +``` + +For a packaged install, consider using the [CONTAINER.md](CONTAINER.md) instructions. + +## Key Features: +- **Feed Processing**: Retrieve and parse RSS/Atom feeds and OPML files, and + send the data to a Kafka topic. +- **State Management**: Track feed states to handle ETag caching, backoff, and + skip rules. +- **Feed Store**: Add, remove, and show stored feeds from an OPML-style feed + store. +- **Kafka Integration**: Send processed feed data to a Kafka topic with + Microsoft Fabric Event Streams support. + +## How to Use + +The tool supports multiple subcommands: +- **Process Feeds (`process`)**: Fetches and processes feed items from URLs or OPML files. +- **Add Feeds (`add`)**: Adds new feed URLs or OPML files to the feed store. +- **Remove Feeds (`remove`)**: Removes specific feed URLs from the feed store. +- **Show Feeds (`show`)**: Displays all stored feed URLs. + +The argument `--state-dir` is optional and common for all subcommands. It +specifies the directory for storing state files and the feed store. If not +provided, the default directory is the user's home directory. + +### **Process Feeds (`process`)** + +Fetches and processes feed items from URLs or OPML files and publishes them to a +Kafka topic as CloudEvents. The event format is described in +[EVENTS.md](EVENTS.md). + + + - `--kafka-bootstrap-servers`: Kafka bootstrap servers (comma-separated list). + - `--kafka-topic`: Kafka topic for publishing. + - `--sasl-username`: SASL username for Kafka authentication. + - `--sasl-password`: SASL password for Kafka authentication. + - `--connection-string`: Microsoft Event Hubs or Microsoft Fabric Event Streams [connection string](#connection-string-for-microsoft-event-hubs-or-fabric-event-streams) (obviates the need for and overrides other Kafka parameters). + - `--state-dir`: Directory for storing state files. + - `feed_urls`: List of RSS/Atom feed URLs or OPML URLs. These are optional if there are feeds registered in the feed store. + + + +#### Example Usage for Kafka + ```bash + rssbridge process --kafka-bootstrap-servers "" --kafka-topic "" --sasl-username "" --sasl-password "" + ``` + +#### Example Usage for Microsoft Event Hubs or Fabric Event Streams + ```bash + rssbridge process --connection-string "" + ``` + +### Add Feeds (`add`) + +Adds new feed URLs or OPML files to the feed store. + +#### Example Usage + + ```bash + rssbridge add + ``` + +### Remove Feeds (`remove`) + +Removes specific feed URLs from the feed store. + +#### Example Usage + +```bash +rssbridge remove +``` + +### Show Feeds (`show`) + +Displays all stored feed URLs. + +#### Example Usage + +```bash +rssbridge show +``` + +### Connection String for Microsoft Event Hubs or Fabric Event Streams + +Instead of manually passing the Kafka connection parameters (`bootstrap-servers`, `topic`, `username`, and `password`), you can use a **connection string** for **Microsoft Event Hubs** or **Microsoft Fabric Event Streams**. This connection string simplifies the configuration by consolidating these parameters. + +#### Format +The connection string should be provided in the following format: +```bash +Endpoint=sb://.servicebus.windows.net/;SharedAccessKeyName=;SharedAccessKey=;EntityPath= +``` + +#### Usage in the Command Line +You can provide the connection string using the `--connection-string` flag: + +```bash +rssbridge process --connection-string "" +``` + +The tool automatically parses the connection string to extract the following +details: +- **Bootstrap Servers**: Derived from the `Endpoint` value. +- **Kafka Topic**: Derived from the `EntityPath` value. +- **SASL Username and Password**: The username is set to `'$ConnectionString'` + and the password is set to the entire connection string. + +This simplifies the Kafka configuration for Microsoft Event Hubs and Fabric +Event Streams. + +### Environment Variables +The tool supports the following environment variables to avoid passing them via +the command line: +- `KAFKA_BOOTSTRAP_SERVERS`: Kafka bootstrap servers (comma-separated list). +- `KAFKA_TOPIC`: Kafka topic for publishing. +- `SASL_USERNAME`: SASL username for Kafka authentication. +- `SASL_PASSWORD`: SASL password for Kafka authentication. +- `FEED_URLS`: Comma-separated list of RSS/Atom feed URLs. +- `STATE_DIR`: Directory for storing state files. +- `CONNECTION_STRING`: Microsoft Event Hubs or Microsoft Fabric Event Stream + connection string. + +### State Management + +The tool maintains a state file (`~/.rss-grabber.json`) that stores the ETag and +the time of the next polling operation for each feed URL. The feed store is +stored in OPML format (`~/.rss-grabber-feedstore.xml`). + -The [RSS feed poller](rss/README.md) is a command line tool that can be used to -retrieve real-time news and blog posts from any RSS feed. The tool can be -configured with a list of RSS feed URLs or OPML files, and it will poll the -feeds at a configurable interval. The RSS client will only forward new items -from the feeds. diff --git a/rss/pyproject.toml b/rss/pyproject.toml index 17ffdb0..dc743bd 100644 --- a/rss/pyproject.toml +++ b/rss/pyproject.toml @@ -28,5 +28,5 @@ pytest-cov = ">=5.0.0" testcontainers = ">=4.8.1" requests-mock = ">=1.12.1" -[build-system.scripts] +[tool.poetry.scripts] rssbridge = "rssbridge:main" diff --git a/rss/rssbridge/__init__.py b/rss/rssbridge/__init__.py index 2d5b41f..14cedf3 100644 --- a/rss/rssbridge/__init__.py +++ b/rss/rssbridge/__init__.py @@ -1,5 +1,5 @@ # __init.py__ -from . import rssbridge +from .rssbridge import main if __name__ == "__main__": - rssbridge.main() + main() diff --git a/rss/rssbridge/rssbridge.py b/rss/rssbridge/rssbridge.py index aa40f82..11d027e 100644 --- a/rss/rssbridge/rssbridge.py +++ b/rss/rssbridge/rssbridge.py @@ -518,7 +518,7 @@ def parse_connection_string(connection_string: str) -> Dict[str, str]: return config_dict -async def main(): +async def run(): """ Main function to handle argparse commands. """ @@ -623,5 +623,8 @@ async def main(): else: parser.print_help() +def main(): + asyncio.run(run()) + if __name__ == "__main__": - asyncio.run(main()) + main() diff --git a/tools/generate-events-md.ps1 b/tools/generate-events-md.ps1 index bb4901b..984cf3d 100644 --- a/tools/generate-events-md.ps1 +++ b/tools/generate-events-md.ps1 @@ -1,4 +1,23 @@ -# switch into this directory with pushd +<# +.SYNOPSIS + This script generates event markdown files. + +.DESCRIPTION + The script is designed to automate the generation of markdown files for events. + It should be executed from the specified directory to ensure all dependencies and resources are correctly referenced. + +.PARAMETER None + No parameters are required for this script. + +.EXAMPLE + To run this script, navigate to the directory using pushd and execute the script. + +.NOTES + Author: [Your Name] + Date: [Date] + FilePath: /c:/git/real-time-sources/tools/generate-events-md.ps1 +#> + pushd $PSScriptRoot python .\printdoc.py ..\gtfs\xreg\gtfs.xreg.json --title "GTFS API Bridge Events" --description "This document describes the events that are emitted by the GTFS API Bridge." > ..\gtfs\EVENTS.md diff --git a/tools/install-avrotize.ps1 b/tools/install-avrotize.ps1 index 5a15b26..54afc6c 100644 --- a/tools/install-avrotize.ps1 +++ b/tools/install-avrotize.ps1 @@ -1,3 +1,23 @@ +<# +.SYNOPSIS + Installs Avrotize, a code generator for schematized data. + +.DESCRIPTION + This script automates the installation of Avrotize, which is utilized by various scripts within this repository. + Avrotize helps in generating code based on defined schemas, facilitating the management and manipulation of structured data. + +.PARAMETER None + This script does not take any parameters. + +.EXAMPLE + Run the script to install Avrotize for use in your projects. + +.NOTES + Author: [Your Name] + Date: [Date] + Version: 1.0 +#> + # Get the user's profile directory $userProfile = [System.Environment]::GetFolderPath('UserProfile') diff --git a/tools/printdoc.py b/tools/printdoc.py index becb3dc..4dea9b2 100644 --- a/tools/printdoc.py +++ b/tools/printdoc.py @@ -1,3 +1,24 @@ +""" +This script generates documentation from an xRegistry JSON manifest file containing message groups and schema groups. + +Usage: + python printdoc.py [--title ] [--description <description>] +Arguments: + manifest_file: Path to the JSON manifest file. + --title: Title of the documentation (replaces "Table of Contents"). Default is "Table of Contents". + --description: Description added under the title. Default is an empty string. +Functions: + main(): Parses command line arguments, reads the JSON manifest file, and generates documentation. + generate_documentation(data, title, description): Generates the documentation content based on the provided data. + process_message(msg, schemagroups): Processes individual messages and their metadata, generating documentation for attributes and schemas. + resolve_schema(schemaurl, schemagroups): Resolves the schema URL to retrieve the corresponding schema from the schema groups. + generate_anchor(name): Generates a markdown-compatible anchor from a given name. + print_schema(schema): Prints the schema documentation, including nested records and enums. + print_record(schema, records_to_document, enums_to_document, documented_records): Prints the documentation for a record schema. + get_field_type_str(field_type, records_to_document, enums_to_document): Returns a string representation of a field type, handling records and enums. + print_enum(enum_schema, documented_enums): Prints the documentation for an enum schema. +""" + import json import argparse diff --git a/tools/run-kql-script.ps1 b/tools/run-kql-script.ps1 index 231e42c..268493c 100644 --- a/tools/run-kql-script.ps1 +++ b/tools/run-kql-script.ps1 @@ -1,3 +1,28 @@ +<# +.SYNOPSIS + Executes a Kusto Query Language (KQL) script against a specified Azure Data Explorer cluster and database. + +.DESCRIPTION + This script checks for the presence of the Kusto CLI tool. If it is not found, it calls an installation script. + Once the Kusto CLI is available, it executes the provided KQL script against the specified cluster and database. + +.PARAMETER clusterUri + The URI of the Azure Data Explorer cluster to connect to. This parameter is mandatory. + +.PARAMETER database + The name of the database within the Azure Data Explorer cluster where the KQL script will be executed. This parameter is mandatory. + +.PARAMETER script + The path to the KQL script file that will be executed. This parameter is mandatory. + +.EXAMPLE + .\run-kql-script.ps1 -clusterUri "https://mycluster.kusto.windows.net" -database "mydatabase" -script "C:\path\to\script.kql" + +.NOTES + Author: [Your Name] + Date: [Date] +#> + param( [Parameter(Mandatory=$true)] [string]$clusterUri,