diff --git a/README.md b/README.md index c969b75..ac9f264 100644 --- a/README.md +++ b/README.md @@ -8,32 +8,23 @@
- -# codecollection-template -A hello-world-style template for codecollection authors to get started writing codebundles. This template contains the minimum file structure expected by the RunWhen platform. - [![Build](https://github.com/runwhen-contrib/codecollection-template/actions/workflows/build.yaml/badge.svg)](https://github.com/runwhen-contrib/codecollection-template/actions/workflows/build.yaml) -## Getting Started -Looking to be a contributor for CodeCollections or start your own? We'd love to collaborate! Head on over to our [public docs](https://docs.runwhen.com/public/runwhen-authors/getting-started-with-codecollection-development) to get started. -File Structure overview of devcontainer: -``` --/app/ - |- auth/ #store secrets here, it should already be properly gitignored for you - |- codecollection/ - | |- codebundles/ # stores codebundles that can be run - | |- libraries/ # stores python keyword libraries used by codebundles - |- dev_facade/ # provides interfaces equivalent to those used on the platform, but just dry runs the keywords to assist with development - ... -``` +[Upstream Docs - CodeCollection Template](https://github.com/runwhen-contrib/codecollection-template/blob/main/README.md) -The included script `ro` wraps the `robot` RobotFramework binary, and includes some extra functionality to write logs to a consistent location for viewing in a HTTP server at http://localhost:3000/ that is always running as part of the devcontainer. +# InfraCloud RunWhen CodeCollection -### Quickstart +This CodeCollection aims to create a repository of CodeBundles that can address the various reproducible incident scenarios at [Infracloud/sre-stack](https://github.com/infracloudio/sre-stack/) -Navigate to the codebundle directory -`cd codecollection/codebundles/hello_world/` +- Set meaningful SLOs on Services and their dependencies + - DBs + - Queues + - Caches + - Gateways and proxies +- Create SLIs to continuosly monitor the health of services and dependencies +- Create mitigation runbooks in some scenarios where root-cause can be deterministically attested to -Run the codebundle -`ro sli.robot` +## Additional Docs +- [RunWhen Concepts](docs/runwhen/concepts.md) +- [Contributing to CodeCollections/CodeBundles](docs/runwhen/contrib.md) \ No newline at end of file diff --git a/codebundles/rds-mysql-conn-count/README.md b/codebundles/rds-mysql-conn-count/README.md index e69de29..70a57ba 100644 --- a/codebundles/rds-mysql-conn-count/README.md +++ b/codebundles/rds-mysql-conn-count/README.md @@ -0,0 +1,94 @@ +# CodeBundle - RDS MySQL Connection Count + +This codebundle targets to detect and resolve an incident caused by too many sleeping connections in MySQL. + +- Target Service - MySQL +- Cloud Platform - AWS/RDS + +## SLX +```YAML +statement: RDS MySql connections should be within 80% of total max connection. +alias: RDS MySql Connections Count +metricType: gauge +asMeasuredBy: Score based on promethues query +icon: Cloud +owners: + - saurabh.yadav@infracloud.io +imageURL: >- + https://storage.googleapis.com/runwhen-nonprod-shared-images/icons/kubernetes/resources/labeled/ns.svg + +``` +## SLO / Service Level Objective +Example: +```YAML +codeBundle: + repoUrl: https://github.com/infracloudio/ifc-rw-codecollection + pathToYaml: codebundles/slo-default/queries.yaml + ref: main +sloSpecType: simple-mwmb +objective: 95 +threshold: 48 +operand: lt +``` + +## SLI / Service Level Indicator +```YAML +displayUnitsLong: OK +displayUnitsShort: ok +locations: + - location-01-us-west1 +description: >- + Watch RDS MySql connection count +codeBundle: + repoUrl: https://github.com/infracloudio/ifc-rw-codecollection + ref: main + pathToRobot: codebundles/rds-mysql-conn-count/sli.robot +# read more about intervalStrategy here: https://docs.runwhen.com/public/runwhen-platform/feature-overview/points-on-the-map-slxs/service-level-indicators-slis/interval-strategies +intervalStrategy: intermezzo +intervalSeconds: 30 +configProvided: + # Change PROMETHEUS_HOSTNAME to your endpoint and currently endpoint needs to be publicly exposed. + - name: PROMETHEUS_HOSTNAME + value: >- + http://aeccfb7ff9bfb4705b6218294a7346c3-2081802229.us-west-2.elb.amazonaws.com/prometheus/api/v1 + - name: QUERY + value: >- + aws_rds_database_connections_average{dimension_DBInstanceIdentifier="robotshopmysql"} > 1 + - name: TRANSFORM + value: RAW + - name: STEP + value: '30' + - name: DATA_COLUMN + value: '1' + - name: NO_RESULT_OVERWRITE + value: 'Yes' + - name: NO_RESULT_VALUE + value: '0' +servicesProvided: + - name: curl + locationServiceName: curl-service.shared +``` + +## RunBook / Mitigation + +```YAML +location: location-01-us-west1 +codeBundle: + repoUrl: https://github.com/infracloudio/ifc-rw-codecollection + ref: main + pathToRobot: codebundles/rds-mysql-conn-count/runbook.robot +servicesProvided: + - name: curl + locationServiceName: curl-service.shared +configProvided: + - name: MYSQL_USER + value: admin + - name: MYSQL_HOST + value: robotshopmysql.example.us-west-2.rds.amazonaws.com + - name: PROCESS_USER + value: shipping +``` + +### Assumptions & Pitfalls + +These configs are placeholder YAML. one needs to modify them according to need and then paste them to the platform side. \ No newline at end of file diff --git a/docs/runwhen/concepts.md b/docs/runwhen/concepts.md new file mode 100644 index 0000000..f8f1fe7 --- /dev/null +++ b/docs/runwhen/concepts.md @@ -0,0 +1,101 @@ +# RunWhen Concepts +- [RunWhen Concepts](#runwhen-concepts) +- [Runwhen Local](#runwhen-local) + - [CheatSheet Generator](#cheatsheet-generator) + - [Uploading Cluster Topology to the Platform](#uploading-cluster-topology-to-the-platform) +- [CodeCollections](#codecollections) +- [CodeBundles](#codebundles) + +# Runwhen Local +- [source-code](https://github.com/runwhen-contrib/runwhen-local) +- [Helm Chart](https://github.com/runwhen-contrib/helm-charts/tree/main/charts/runwhen-local) +- [Upstream docs](https://docs.runwhen.com/public/v/runwhen-local/) + +RunWhen Local has two core functions: +- Generate remediation scripts / CheatSheets from included templates for your local cluster +- Upload Cluster Topology to the RunWhen Platform + +## CheatSheet Generator +At the moment RunWhen Local **does not posses the ability to discover issues** in +your cluster and suggest mitigation runbooks / codebundles. + +**However, it discovers your kubernetes resources and object names.** +Using which, it generates a wide set of runbooks for you, if you already know the +root cause. These runbooks contain documentation and pastable shell script +snippets for the searched issue. These scripts / cheatsheet are already pre-templated +with your namespaces and kubernetes resource names. + +This collection of cheatsheets / runbooks, although not exhaustive, covers a significant portion +of recurring issues and healthcheck failures and can be useful to SREs for quick +resolution of incidents. + +[Upstream Examples](https://docs.runwhen.com/public/v/runwhen-local/user-guide/features/user_guide-feature_overview) + +## Uploading Cluster Topology to the Platform +The second core function of runwhen-local is to upload cluster topology to the +runwhen platform so you can visualize the cluster workload map from a configured +runwhen workspace. + +- First, follow documentation at [Upload to RunWhen Platform](https://docs.runwhen.com/public/v/runwhen-local/user-guide/features/upload-to-runwhen-platform#upload-from-the-cli) + - To generate the `uploadInfo.yaml` file +- Next, take the yaml object and copy over it's contents to `uploadInfo:[]` section +of the helm [`values.yaml` file](https://github.com/runwhen-contrib/helm-charts/blob/main/charts/runwhen-local/values.yaml#L121) +- Once configured it should look like this: + ```YAML + uploadInfo: + workspaceName: