diff --git a/README.md b/README.md index c969b75..ac9f264 100644 --- a/README.md +++ b/README.md @@ -8,32 +8,23 @@

Open in GitHub Codespaces - -# codecollection-template -A hello-world-style template for codecollection authors to get started writing codebundles. This template contains the minimum file structure expected by the RunWhen platform. - [![Build](https://github.com/runwhen-contrib/codecollection-template/actions/workflows/build.yaml/badge.svg)](https://github.com/runwhen-contrib/codecollection-template/actions/workflows/build.yaml) -## Getting Started -Looking to be a contributor for CodeCollections or start your own? We'd love to collaborate! Head on over to our [public docs](https://docs.runwhen.com/public/runwhen-authors/getting-started-with-codecollection-development) to get started. -File Structure overview of devcontainer: -``` --/app/ - |- auth/ #store secrets here, it should already be properly gitignored for you - |- codecollection/ - | |- codebundles/ # stores codebundles that can be run - | |- libraries/ # stores python keyword libraries used by codebundles - |- dev_facade/ # provides interfaces equivalent to those used on the platform, but just dry runs the keywords to assist with development - ... -``` +[Upstream Docs - CodeCollection Template](https://github.com/runwhen-contrib/codecollection-template/blob/main/README.md) -The included script `ro` wraps the `robot` RobotFramework binary, and includes some extra functionality to write logs to a consistent location for viewing in a HTTP server at http://localhost:3000/ that is always running as part of the devcontainer. +# InfraCloud RunWhen CodeCollection -### Quickstart +This CodeCollection aims to create a repository of CodeBundles that can address the various reproducible incident scenarios at [Infracloud/sre-stack](https://github.com/infracloudio/sre-stack/) -Navigate to the codebundle directory -`cd codecollection/codebundles/hello_world/` +- Set meaningful SLOs on Services and their dependencies + - DBs + - Queues + - Caches + - Gateways and proxies +- Create SLIs to continuosly monitor the health of services and dependencies +- Create mitigation runbooks in some scenarios where root-cause can be deterministically attested to -Run the codebundle -`ro sli.robot` +## Additional Docs +- [RunWhen Concepts](docs/runwhen/concepts.md) +- [Contributing to CodeCollections/CodeBundles](docs/runwhen/contrib.md) \ No newline at end of file diff --git a/codebundles/rds-mysql-conn-count/README.md b/codebundles/rds-mysql-conn-count/README.md index e69de29..70a57ba 100644 --- a/codebundles/rds-mysql-conn-count/README.md +++ b/codebundles/rds-mysql-conn-count/README.md @@ -0,0 +1,94 @@ +# CodeBundle - RDS MySQL Connection Count + +This codebundle targets to detect and resolve an incident caused by too many sleeping connections in MySQL. + +- Target Service - MySQL +- Cloud Platform - AWS/RDS + +## SLX +```YAML +statement: RDS MySql connections should be within 80% of total max connection. +alias: RDS MySql Connections Count +metricType: gauge +asMeasuredBy: Score based on promethues query +icon: Cloud +owners: + - saurabh.yadav@infracloud.io +imageURL: >- + https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/kubernetes/resources/labeled/ns.svg + +``` +## SLO / Service Level Objective +Example: +```YAML +codeBundle: + repoUrl: https://github.com/infracloudio/ifc-rw-codecollection + pathToYaml: codebundles/slo-default/queries.yaml + ref: main +sloSpecType: simple-mwmb +objective: 95 +threshold: 48 +operand: lt +``` + +## SLI / Service Level Indicator +```YAML +displayUnitsLong: OK +displayUnitsShort: ok +locations: + - location-01-us-west1 +description: >- + Watch RDS MySql connection count +codeBundle: + repoUrl: https://github.com/infracloudio/ifc-rw-codecollection + ref: main + pathToRobot: codebundles/rds-mysql-conn-count/sli.robot +# read more about intervalStrategy here: https://docs.runwhen.com/public/runwhen-platform/feature-overview/points-on-the-map-slxs/service-level-indicators-slis/interval-strategies +intervalStrategy: intermezzo +intervalSeconds: 30 +configProvided: + # Change PROMETHEUS_HOSTNAME to your endpoint and currently endpoint needs to be publicly exposed. + - name: PROMETHEUS_HOSTNAME + value: >- + http://aeccfb7ff9bfb4705b6218294a7346c3-2081802229.us-west-2.elb.amazonaws.com/prometheus/api/v1 + - name: QUERY + value: >- + aws_rds_database_connections_average{dimension_DBInstanceIdentifier="robotshopmysql"} > 1 + - name: TRANSFORM + value: RAW + - name: STEP + value: '30' + - name: DATA_COLUMN + value: '1' + - name: NO_RESULT_OVERWRITE + value: 'Yes' + - name: NO_RESULT_VALUE + value: '0' +servicesProvided: + - name: curl + locationServiceName: curl-service.shared +``` + +## RunBook / Mitigation + +```YAML +location: location-01-us-west1 +codeBundle: + repoUrl: https://github.com/infracloudio/ifc-rw-codecollection + ref: main + pathToRobot: codebundles/rds-mysql-conn-count/runbook.robot +servicesProvided: + - name: curl + locationServiceName: curl-service.shared +configProvided: + - name: MYSQL_USER + value: admin + - name: MYSQL_HOST + value: robotshopmysql.example.us-west-2.rds.amazonaws.com + - name: PROCESS_USER + value: shipping +``` + +### Assumptions & Pitfalls + +These configs are placeholder YAML. one needs to modify them according to need and then paste them to the platform side. \ No newline at end of file diff --git a/docs/runwhen/concepts.md b/docs/runwhen/concepts.md new file mode 100644 index 0000000..f8f1fe7 --- /dev/null +++ b/docs/runwhen/concepts.md @@ -0,0 +1,101 @@ +# RunWhen Concepts +- [RunWhen Concepts](#runwhen-concepts) +- [Runwhen Local](#runwhen-local) + - [CheatSheet Generator](#cheatsheet-generator) + - [Uploading Cluster Topology to the Platform](#uploading-cluster-topology-to-the-platform) +- [CodeCollections](#codecollections) +- [CodeBundles](#codebundles) + +# Runwhen Local +- [source-code](https://github.com/runwhen-contrib/runwhen-local) +- [Helm Chart](https://github.com/runwhen-contrib/helm-charts/tree/main/charts/runwhen-local) +- [Upstream docs](https://docs.runwhen.com/public/v/runwhen-local/) + +RunWhen Local has two core functions: +- Generate remediation scripts / CheatSheets from included templates for your local cluster +- Upload Cluster Topology to the RunWhen Platform + +## CheatSheet Generator +At the moment RunWhen Local **does not posses the ability to discover issues** in +your cluster and suggest mitigation runbooks / codebundles. + +**However, it discovers your kubernetes resources and object names.** +Using which, it generates a wide set of runbooks for you, if you already know the +root cause. These runbooks contain documentation and pastable shell script +snippets for the searched issue. These scripts / cheatsheet are already pre-templated +with your namespaces and kubernetes resource names. + +This collection of cheatsheets / runbooks, although not exhaustive, covers a significant portion +of recurring issues and healthcheck failures and can be useful to SREs for quick +resolution of incidents. + +[Upstream Examples](https://docs.runwhen.com/public/v/runwhen-local/user-guide/features/user_guide-feature_overview) + +## Uploading Cluster Topology to the Platform +The second core function of runwhen-local is to upload cluster topology to the +runwhen platform so you can visualize the cluster workload map from a configured +runwhen workspace. + +- First, follow documentation at [Upload to RunWhen Platform](https://docs.runwhen.com/public/v/runwhen-local/user-guide/features/upload-to-runwhen-platform#upload-from-the-cli) + - To generate the `uploadInfo.yaml` file +- Next, take the yaml object and copy over it's contents to `uploadInfo:[]` section +of the helm [`values.yaml` file](https://github.com/runwhen-contrib/helm-charts/blob/main/charts/runwhen-local/values.yaml#L121) +- Once configured it should look like this: + ```YAML + uploadInfo: + workspaceName: + token: # Do NOT add token and commit to git + workspaceOwnerEmail: tester@my-company.com + papiURL: https://papi.beta.runwhen.com + defaultLocation: location-01-us-west1 # available runwhen locations + ``` +- You should pass the token from helm cli, to ensure you are not leaking the token via git + ```bash + helm upgrade --install ${HELM_RELEASE_NAME} runwhen-contrib/runwhen-local \ + --set uploadInfo.token=${RUNWHEN_PLATFORM_TOKEN} \ + -f ${VALUES_FILE} -n ${NAMESPACE} + ``` + +# CodeCollections +CodeCollections are a group of CodeBundles that can be referenced and used in RunWhen Platform. + +*N.B. It's important to note here that currently codecollections cannot be imported explicitly and run against your local cluster using RunWhen Local* + +Currently RunWhen has published two codecollections: +- [runwhen-public-codecollection](https://github.com/runwhen-contrib/rw-public-codecollection) + - These contain codebundles that are usually run against services and doesn't involved a Shell / CLI component +- [runwhen-cli-codecollection](https://github.com/runwhen-contrib/rw-cli-codecollection) + - These are generally targeted towards SRE workloads and wraps various shell-scripts and CLI tooling. + +# CodeBundles +CodeBundles are specific detectors/mitigators of known SLI/SLO violations in a live software stack. + +It comprises of: +- Robot files + - Scripts / Playbooks / tasksets written using [Robot Framework](), that either + - Create and enforce RunWhen SLIs - `sli.robot` + - Create miitigation runbooks in response to an SLO/SLI violation - `runbook.robot` +- Platform definitions of `{SLX, SLO, SLI, Runbook}` as `YAML` configurations + - These do not need to be located in your repo, however it's a good practice to have them committed in git. + - These configurations wrap standard behaviors for interacting with RunWhen Platform API, `papi` + - Endpoint: `https://papi.beta.runwhen.com` + - The RunWhen `YAML` configurations are only pertinent when your codebundle is live on RunWhen Platform, these do not play any role as of now for either local testing or RunWhen Local. +- Test resources / scripts + +In a local testing environment you only need to execute the `*.robot` files inside the provided container configurations, +- [Dockerfile](../../Dockerfile) +- [vscode/devcontainer](../../.devcontainer.json) + + +The usual call chain is as follows: +- Robot Scripts + - User variable and secret injection + - Runwhen Libraries + - RunWhen Services + - Wrapped shell CLI command / Platform SDK code execution + - or, direct shims to your shell scripts / python code when services are unavailable + - These tasks fetch the current value of a metric / state + - This metric value is then compared against the defined thresholds at `sli/slo.yaml` in the platform. + - If the Robot script just runs a set of tasks as a mitigation step, it returns either success or failure. + +More concepts and non-trivial FAQs around writing CodeBundles are explained at [Contributing to CodeCollections/CodeBundles](contrib.md) \ No newline at end of file diff --git a/docs/runwhen/contrib.md b/docs/runwhen/contrib.md new file mode 100644 index 0000000..f79a7c7 --- /dev/null +++ b/docs/runwhen/contrib.md @@ -0,0 +1,56 @@ +# Contributing to CodeCollections/CodeBundles + +## Creating a New CodeCollection +### Forking the template repository + +## Writing a Non-trivial CodeBundle +### Directory structure / Scaffolding + + + +######### +Repository Setup +Introduction to Robot Framework Scripts (how it interacts with RunWhen) +Calling bash with relative paths +Secret handling +Suite Initialization +Library usage +Explain the call chain +Library Setup +How to get an exhaustive list of available libraries +CLI repo +Public repo +Explain what libraries would be auto-fetched by devcontainer tooling +Core +CLI +What needs to be added for specific libraries that are used in a robot script +Paths +Running a test with local docker +Adding additional binaries to devcontainer as needed +Mysql-client +Postgres-client +Redis-client +Configuring Env / secrets +Expose endpoints +Local docker network +Expose from test cluster +Test by using docker run on localhost +Test in your live environment +Deploy as a k8s job +Give an example +Testing on Runwhen Platform +Connecting test env/cluster to runwhen +Runwhen-local upload +If Robot script needs to use additional dependencies, like CLI tools the devs need to be informed and for now they will handle the update on platform side +Mysql-client +Postgres-client +Redis-client +Registering your first codecollection to Runwhen-platform +Mention that this may be in private as per developer discretion +How to configure the YAML to test +Branch name length limitations +Expose metric endpoints so that they are accessible to runwhen-platform codebundles +Configuring Env / secrets +Running the test +Checking logs +Checking for errors