Skip to content

Releases: litmuschaos/litmus

1.13.0

15 Feb 19:23
dc086b3
Compare
Choose a tag to compare

New Features & Enhancements

  • Moves the Litmus Portal to beta-2 phase with the following improvements:

    • Ability to disable workflow schedules
    • Support for configuration of private Git repositories as a source for experiments & predefined workflows (private MyHub)
    • Allows the full set of CRUD operations on the embedded ChaosHub/MyHub
    • Improves the chaos visualization via horizontal/vertical workflow views and proper formatting of logs for the workflow nodes.
  • Enhances the ChaosExperiment CRD to take HostPath Volume Type input.

  • Removes the limitation that only a single workload (amongst those sharing the labels) can be annotated for chaos.

  • Enhances the httpProbe to perform POST operations with payload described in the ChaosEngine or via a file mounted as a configmap.

  • Simplifies node resource chaos experiments to accept resources in units (mebibytes) along with relative percentage inputs.

  • Makes the termination mode configurable for the container-kill experiment (defaults to SIGKILL)

  • Adds more details to experiment logs around annotated workloads & filtered pod targets

  • Improves the disk-fill chaos experiment to use the helper pod approach for injection instead of running a dummy pod with a sleep command into which multiple exec operations occur.

  • Additional unit tests in the chaos-operator & chaos-runner repos.

  • Improves e2e tests (PRs/Commits) (pod chaos with combinations of pods_affected_perc & sequence env, annotation on multiple workloads etc.,) in the litmus-go repo

  • Updates the litmus-sdk based on recent changes to experiment templates

Major Bug Fixes

  • Ensures that different helper pods within an experiment instance are labeled with unique values (for fixed keys) in order to query them for status. Without this, these helper pods were being filtered by common labels resulting in incorrect validation. This is more so when multiple instances of the same experiment are executed in parallel.

  • Reflects the correct verdict of the experiment upon failure and abort, along with improved events in the Kafka & Cassandra chaos experiments.

  • Ensures smooth re-run of network chaos on a target with residual tc rule from the previous instance of chaos injection (RTNETLINK answers: File exists)

  • Fixes the console spamming log messages on chaos-exporter which were seen until the ChaosResult/Engine resources were created.

Major Known Issues & Limitations

Issue:

Forced removal of the experiment helper pods (where applicable: notably network chaos experiments) either manually or due to Kubernetes eviction can render the chaos revert operation at the end of the chaos duration a failure/ a non-event. This will cause the application under test (AUT) to continue being subjected to chaos unless manually recovered.

Workaround:

With experiment pod logs it can be deciphered that the helper operations have failed. In which case, the AUT pod(s) can be deleted so they can be rescheduled again (this is applicable only to those applications deployed as a higher-level controller such as deployment/statefulset/daemonset, etc.,) with a new network namespace.

Fix:

This is being actively worked on (retry mechanism for chaos revert initiated in case of failed/missing helper pods) and should be available in a subsequent release.

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Fix:

This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.12.0

15 Jan 19:05
390d134
Compare
Choose a tag to compare

New Features & Enhancements

  • Moves the Litmus Portal to beta-1 phase with the following improvements:

    • Supports edit of (cron) schedule in chaos workflows
    • Ability to suspend/disable schedules
    • Improved chaos workflow diagrams with appropriate log representation for different stages/steps
    • Increased (K8s) validation in the workflow construction wizard
    • Adds the infra changes necessary to support private repositories for MyHub (UI support to come in 1.13.0)
  • Introduces a revamped chaos-exporter that removes the current dependency on the heptio event-router for the experiment execution state, which was being used to build chaos-interleaved application dashboards. The chaos exporter now pushes an increased set of metrics on chaos start/end times, status, success percentage per run, experiment specific cumulative pass/fail counts, etc., and has options to operate in both cluster-wide as well as the namespaced modes.

  • Enhances the httpProbe with options to skip certificate checks via the insecureSkipVerify flag in the ChaosEngine schema

  • Enhances the pod-autoscaler experiment with the ability to scale multiple applications (type: deployments, statefulsets) based on an APP_AFFECTED_PERC environment variable, with the apps being filtered via label selectors. Also adds support for OnChaos probes for the experiment.

  • Supports random selection of EC2 instances/Kubernetes nodes for the ec2-terminate experiment in cases where the target instance is not explicitly specified.

  • Improves error handling logic in the node-drain experiment and also adds a timeout (equal to the chaos duration period) flag to the drain operation to prevent indefinite execution (ex: to honor pod disruption budgets, stuck evictions)

  • Extends the ImagePullPolicy configuration to external probe pods (in cases where the cmdProbe is configured to run on “source” images other than the litmus go-runner).

  • Homogenizes the experiment pod logs for target pod information prior to chaos injection

  • Promotes the non-root go-runner from tech-preview to a release image. Accompanied by changes to experiments where applicable (commands, paths & file permissions)

  • Introduces a tech-preview of enhanced chaos rollback/revert logic (used initially for network chaos experiments executed in “serial” sequence ) to achieve guaranteed chaos rollback/revert under failure conditions (helper pod eviction, unexpected chaos process termination, deletion/removal, etc.,) (litmuschaos/go-runner:1.12.0-revert)

  • Enhances the ChaosResult schema to hold cumulative success/failure count information of the different run instances for a given experiment.

  • Introduces a new scaffolded chaoslib template in the litmus SDK that allows injection and revert of chaos via the CHAOS_INJECT_COMMAND & CHAOS_KILL_COMMAND environment variables, thereby giving users flexibility in creating preview experiments.

  • Releases the v0.3.1 of the chaos-ci-lib with fixes and enhancements to the chaos BDD library, and updates the e2e suites to use it.

  • Migration to GitHub Actions (with parallel workflows for lint, security scan, e2e & build/push operations) from TravisCI (where applicable) in lieu of reduced support for OSS projects on the latter.

  • Enhances the litmus-e2e suite with new tests for verification of annotation-enabled & disabled chaos execution, ec2-terminate experiment & pumba-based chaoslib functionality. Adds the feature coverage tracker with an initial set of testcases for litmus-portal e2e pipelines

  • Enhances the litmus-helm chart testing workflows as per the latest K8s/Helm standards

  • Improves the node-restart & adds node-poweroff experiment documentation with steps to obtain the ssh-keys & setup the secrets for execution.

  • Simplifies the experiment pages UX on the ChaosHub with explanation/steps to use the chaos artifacts

Major Bug Fixes

  • Fixes spurious events received on ChaosEngines installed with engineState set to stop (for deferred execution purposes). Also ensures that the ChaosInitialization is recorded once finalizers have been applied on the CR

  • Prevents a false positive with probe execution (in cases where probes were defined without the RunProperties specification) by mandating the latter using CRD validation.

  • Fixes failed/timed-out helper pod checks in the node-restart and node-poweroff experiments with an enhanced status check logic that looks for variadic/desired pod states (such as Succeeded, Running, etc..,) instead of just “Running”

  • Fixes the failure to kill target docker containers using the “litmus” LIB due to the missing “host” flag pointing to the correct daemon socket path

  • Fixes a regression on the pod-cpu-hog experiment that caused only a single md5sum process to be launched on the target pods irrespective of the CPU_CORES (number of cores) input to the experiment.

  • Fixes a regression (panic) on the chaos-runner caused upon secret volumes definition in the ChaosExperiment/ChaosEngine

  • Synchronizes event messages (from the experiment pod as well as chaos-runner pod sources) with the latest experiment status/verdict in case of repeated execution (caused by frequent abort/restart operations) instead of holding stale info.

  • Replaces hardcoded socket paths in experiment helper configurations with values derived from the SOCKET_PATH environment variable

  • Fixes failed application status checks on infra-chaos experiments where the .spec.appinfo.applabel is not specified/skipped. In this case, the health of all pods in the chaos namespace is verified.

  • Fixes the documentation with the correct kubectl command to patch the ChaosEngine for abort/restart.

Major Known Issues & Limitations

Issue:

Forced removal of the experiment helper pods (where applicable: notably network chaos experiments) either manually or due to Kubernetes eviction can render the chaos revert operation at the end of the chaos duration a failure/ a non-event. This will cause the application under test (AUT) to continue being subjected to chaos unless manually recovered.

Workaround:

With experiment pod logs it can be deciphered that the helper operations have failed. In which case, the AUT pod(s) can be deleted so they can be rescheduled again (this is applicable only to those applications deployed as a higher-level controller such as deployment/statefulset/daemonset, etc.,) with a new network namespace.

Fix:

This is being actively worked on (retry mechanism for chaos revert initiated in case of failed/missing helper pods) and should be available in a subsequent release.

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Fix:

This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.12.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.11.0

15 Dec 19:52
d2fdadb
Compare
Choose a tag to compare

New Features & Enhancements

  • Moves the Litmus Portal to beta-0 phase with first-cut API documentation, view-only users, install/operation support in air-gapped environments, non-root/non-privileged containers etc.,

  • Introduces the Prometheus Probe to facilitate metrics based SLO validation during experiment runs

  • Enhances the litmus probes by adding regex support for output comparison, OpenAPI v3 based CRD validation for probe schema, error handling & probe logging improvements

  • Adds the node-restart & node power-off experiments for Kubevirt based Linux VMs

  • Support for adding ENV variable values from ConfigMaps and Secrets in the ChaosEngine. This is especially useful in the case of platform-specific (Ex: AWS) chaos experiments.

  • Allows chaos annotations for more than one application workload that shares the same labels (controls chaos for a set of apps)

  • Supports the definition of resource requests/limits for chaos-runner & helper pods

  • Extends the native litmus chaoslib for network chaos on docker runtime while continuing to support pumba lib. This is expected to help users that do not want additional images (defined by the TC_IMAGE env in the network chaos experiments) pulled during the course of the experiment.

  • Cleans up the failed/orphaned helper pods based on the jobCleanupPolicy specified in ChaosEngine.

  • Propagates the ImagePullPolicy of the experiment resource to the helper pods

  • Refactors the chaos-runner to avoid experiment unnecessary dependency checks (for configmaps, secrets) where applicable and alter the flow to fail faster in case of issues such as missing experiment CRs, etc.,

  • Removes dependency on (availability of) crictl.yaml on the Kubernetes nodes for the execution of experiments on containerd/crio runtime (esp useful for K3s, MicroK8S platforms)

  • Reduces the image sizes for the chaos-operator & chaos-runner pods while significantly reducing vulnerabilities with a new base image

  • Adds non-root experiment (go-runner) images in the tech-preview stage for beta testing.

  • Introduces a recommended PodSecurityPolicy configuration for LitmusChaos experiments for use in restricted environments

  • Improves the experiment bootstrap experience with a simple scaffold CLI/SDK

  • Simplifies the ChaosEngine sample specs on the ChaosHub by removing redundant attributes, renaming the ENVs referring to remote services/hosts in the network chaos experiments, synchronizing runtime & socket-path vars, etc.,

  • Adds integration tests as a PR check (triggered on each commit unless skipped via tag) on a containerd based cluster (KIND) for the chaos-operator, chaos-runner, litmus-go & litmus-helm repos

  • Improves the litmus-e2e with dedicated pipelines on AWS cloud for pod level, infra (node) level experiment tests & control plane functionality tests with schedules setup for nightly builds on the ci tag. This aids in faster and easier on-demand execution.

  • Adds a first-cut visualization of the e2e metrics based on a coverage tracker

  • Includes a helm chart (with an entry/release item on the helmfile) for the litmus-portal

  • Provides an option to execute the Litmus Demo from a container and adds EKS as a test platform.

Major Bug Fixes

  • Fixes issues in the chaos-runner & experiment logic which led to failed event generation when the experiment is restarted post an abort operation

  • Adds the pods/exec resource to the experiment RBAC to support the source mode of operation of cmdProbe wherein the probe command is executed from within a dedicated pod whose source image has been specified. Without this change, probe execution is unsuccessful.

  • Fixes the behavior where the application pods configured with liveness probes enter CrashLoopBackOff state post network chaos injection in case of containerd runtime. This was caused due to an unsuccessful revert of chaos due to the change in container PID which was used by litmus to inject the netem rules. The fix involves injecting the rules on the corresponding sandbox container instead of the app container itself thereby facilitating successful chaos revert. With this, the app pods are expected to recover w/o manual intervention depending upon the existing backOff delay/no of restarts during the desired chaos duration.

  • Fixes the developer flow with Okteto based dev container/environments: executing the experiment code from within the litmus-experiment test deployment was seen to fail due to failed probe initialization (whereas the chaosengine is not defined at this stage at all). This has been fixed to ensure the probe initialization occurs only if the experiment is triggered by the chaosengine & probes are defined.

  • Removes “auxiliaryAppInfo” as an attribute in non-infra experiments (w/o cluster-wide rolebinding). Providing this attribute in pod-level experiments caused failed entry/exit application status checks due to lack of permissions.

  • Cleans up the permissions on the chaos operator cluster role to avoid listing of unrelated resources under API groups

  • Fixes version comparison on the ChaosHub server to reflect the latest chaos-charts release on the website

  • Fixes the chaos-exporter deployment crash upon startup with appropriate entrypoint script

  • Propagates the docker socket file path to the pumba helper pod for network chaos experiments instead of the hardcoded /var/run/docker.sock

Major Known Issues & Limitations

Issue

Forced removal of the experiment helper pods (where applicable: notably network chaos experiments) either manually or due to Kubernetes eviction can render the chaos revert operation at the end of the chaos duration a failure/ a non-event. This will cause the application under test (AUT) to continue being subjected to chaos unless manually recovered.

  • Workaround

    With experiment pod logs it can be deciphered that the helper operations have failed. In which case, the AUT pod(s) can be deleted so they can be rescheduled again (this is applicable only to those applications deployed as a higher-level controller such as deployment/statefulset/daemonset, etc.,) with a new network namespace.

  • Fix

    This is being actively worked on (retry mechanism for chaos revert initiated in case of failed/missing helper pods) and should be available in a subsequent release.

Issue

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

  • Workaround

    • Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND
    • Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.
  • Fix:

    • This is being actively worked on (native litmus chaoslib that can inject stress processes w/o exec requirement for docker/containerd/crio) and should be available in a subsequent release.

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.11.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.10.0

15 Nov 20:44
4cd5ed9
Compare
Choose a tag to compare

New Features & Enhancements

  • Introduces the alpha-2 version of Litmus Portal with:

    • Ability to configure custom chaos charts (experiment custom resources) source, a.k.a., “MyHub” to a project
    • Support for full CRUD operations on chaos (argo) workflows
    • Support for graceful removal of connected cluster targets
    • Optimizes the workflow for self-cluster connect, i.e., ability to add the cluster hosting the portal itself as a target.
    • Enhanced event handling for chaos workflows
    • Improves resiliency of the portal front-end
  • Adds support for resource filtering and chaos injection on pods managed by Argo Rollout resources, facilitating validation of blue-green & canary deployments

  • Promotes multiarch (amd64, arm64) docker images for all major litmus infra components: chaos-operator, chaos-runner, go-runner, chaos-exporter

  • Introduces a newer probe mode “OnChaos” for verification of steady-state only during the chaos injection period. This is specifically useful for “negative-test” scenarios where the result of steady-state checks are dependent/tied to the unavailability of certain services.

  • Extends the scope of the cmdProbe by supporting complex criteria against different output types: integer/float (equal to, less than/less than equal to, greater than/greater than equal to) and strings (substring, string match)

  • Paves way for increased application filtering and resource-specific status checks via propagation of application kind to the experiment job.

  • Supports definition of taint tolerations in the chaos-runner & experiment pods via ChaosEngine to enable scheduling of chaos resources on nodes specifically tainted for this purpose.

  • Supports the specification of NodeSelector in chaos-runner pods via ChaosEngine for guaranteed-schedule on dedicated nodes.

  • Includes experiments to induce chaos on platform resources (AWS) as part of the kube-aws experiment suite:

  • Terminates EC2 instances (cluster nodes) using a native litmus chaoslib that leverages the AWS Go SDK
    Induces disk loss via detachment of EBS volumes/disks attached to the specified instance

  • Introduces an SSH-based node restart experiment to the generic experiment suite (tech preview)

  • Lists use-cases for testing resiliency of Kubernetes system and add-on components (kube-proxy, kiam, calico, etc.,) based on pod-delete chaos under the kube-components suite

  • Provides an option to specify blast-radius (NODES_AFFECTED_PERCENTAGE) for node-level resource chaos experiments

  • Allows specification of a comma-separated list of target pods or nodes in cases where a known set of objects need to be targeted.

  • Adds specification of an optional VOLUME_MOUNT_PATH env variable to the pod-level IO stress experiment, thereby allowing capacity/stress chaos against both ephemeral and persistent storage volumes.

  • Enhances the pod-autoscaler experiment to:

    • Act on statefulsets, apart from deployments.
    • Abort experiment to result in an immediate rollback to initial replica count
    • Adds chaos-duration as the upper-limit for pod scale
  • Enhances the default pre-chaos criteria on the respective infra-level experiments to check infra components health (nodes, disk) apart from just the applications under test / auxiliary applications

  • Homogeneizes the environment variable naming patterns across experiments for pod and node details and improves probe logs to be more descriptive of the status and errors.

  • Adds more validation capability to the admission controller (presence of application namespace) along with increasing unit-test coverage

  • Improves the experiment e2e suite with tests for all the newly included enhancements with enhancements to add validation (chaos-execution checks) for network & resource chaos experiments

  • Provides a new helm chart for Litmus Portal with the ability to control mode of portal operation (namespaced v/s cluster scope) amongst other tunables

  • Enhances the litmus documentation with steps for helm based install, references to learning resources (tutorials, arch slides), docs for the newly added experiments & improved contributing guide.

  • Dockerizes the litmus-demo script to ease demo steps

  • The period of this release also saw the SIG-Orchestration being operationalized. Refer the meeting notes here

Major Bug Fixes

  • Prevents attempts to generate call-home metrics when the ANALYTICS environment variable is set to false on the chaos operator deployment. Multiple failed attempts to send the g.analytics events in air-gapped environments were seen to result in additional time taken to launch the experiment jobs (nearly 10-12s)

  • Reduces the time taken between successive events on the chaos-runner and also fixes the behavior of missed events

  • Optimizes the time taken to gauge successful experiment pod schedule and completion via reduced polling intervals

  • Fixes the behavior where the chaos events are overridden when more than one experiment is listed in the ChaosEngine

  • Fixes issues with the CI scripts in the chaos-charts repo that lead to repetition/duplication of experiments in the suite/category-wise concatenated experiments.yaml

  • Fixes incorrect schema in probe examples in the documentation

Major Known Issues & Limitations

Issue:

The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND. Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Issue:

Experiments requiring mount of the runtime socket file may fail on MicroK8s or K3s environments with error Falied to load config file: read /etc/crictl.yaml: is a directory.

Workaround/Fix:

This is being investigated

Issue

The pod-cpu-hog experiment using the pumba chaoslib can end ungracefully (after successfully injecting chaos for the specified duration) with this error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed, randomly, on some platforms like EKS. In this case, the experiment verdict can tend to show up as Fail due to the chaoslib pod entering a failed state, despite the chaos being injected.

Workaround/Fix:

This is being investigated

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.10.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.10.0-RC2

15 Nov 19:53
4cd5ed9
Compare
Choose a tag to compare
1.10.0-RC2 Pre-release
Pre-release
Fixing circleci config (#2355) (#2356)

Signed-off-by: Raj Babu Das <[email protected]>

1.10.0-RC1

14 Nov 06:44
5f3cf27
Compare
Choose a tag to compare
1.10.0-RC1 Pre-release
Pre-release
Merge pull request #2349 from rajdas98/cherry-pick-1.10.x-v1

Cherry picking from master to 1.10.x

1.9.0

15 Oct 11:24
de14444
Compare
Choose a tag to compare

New Features & Enhancements

  • Introduces the alpha-1 version of the Litmus Portal. Adds support for scheduled workflows, chaos workflows on external agents, namespaced mode of operation, workflow analytics comparison. Also includes additional pre-defined workflows, and enhanced UX around user management.

  • Enhances the K8s probe to support full CRUD operations against native/custom resources. This is especially useful during chaos on “control-plane” components where provisioning/de-provisioning abilities can be tested. Also adds more filters to the K8s probe (labelSelectors)

  • Supports ordered execution of probes with the ability to reuse probe (result) artifacts in “downstream” probes, thereby enabling the creation of complex exit checks in standard experiments. The probe artifacts are referenced via standard templates in the ChaosEngine schema.

  • Supports configmaps & secrets definition for the chaos-runner pod. One emerging use case that makes use of this feature is to achieve cross-cluster chaos, wherein the chaos-runner executes the experiment on a different cluster to the one where the chaos operator/runner (litmus control plane) resides.

  • Allows resource request/limits specification for chaos resources (chaos-runner, experiment pods) in the ChaosEngines. Aids operations in multi-tenant environments where the experiments are being executed simultaneously across several namespaces, leading to a large set of chaos pods.

  • Adds support for ImagePullSecrets for chaos resources in the ChaosEngine to enable operations in cases where private image registries are used.

  • Provides golang chaoslib for Kafka chaos with enhancements to dynamically retrieve "current" partition leaders for each iteration of the broker kill.

  • Supports network chaos between desired microservices (specified via service IP or hostname filters) on containerd & CRIO runtime

  • Introduces different modes of chaos execution - serial and parallel defined via a SEQUENCE env var for cases where the experiment blast radius is higher. This allows chaos to be executed sequentially or in parallel on the replicas of the application under test (AUT)

  • Supports abort operation for all node & pod-level chaos experiments (except kubelet/docker service kill), including those running chaos processes in the target container’s network/process namespace. Also handles probe status for abort scenarios.

  • Minimizes the permissions/scope of the clusterroles used in the chaos operator and admin-mode serviceaccount to better comply with standard security constraints.

  • Optimizes the code structure in the litmus-go repo to ensure a single experiment binary is built (which takes individual experiment names as args) instead of building binaries for each experiment, resulting in an experiment image with a much-reduced size footprint.

  • Releases a set of multi-arch (arm64, amd64) images with tag multiarch-1.9.0 for technical preview & feedback (built via docker buildx). Will be eventually assimilated into standard release images.

  • Improves build process via docker security checks, linting & formatting checks in missing components/repos.

  • Adds the recommended Kubernetes labels for all chaos resources to enable group-management by external tools.

  • Propagates the labels & identifiers of the chaos experiment pod (defined in the ChaosExperiment CRs) to the ChaosResults to allow segregation/management.

  • Improves error handling & logging (structured logs with logrus) in the chaos-runner & experiments.

  • Improves the scaffolding tool to bootstrap experiment artifacts with the latest schema enhancements (probe support, abort support, etc.,)

  • Improves the (validation webhook) admission-controller to verify availability of configmap & secret resources specified for a chaos experiment.

  • Introduces a helmfile for Litmus to package the infra (operator, CRDs) & the experiment helm charts as part of a single (litmus stack) installation.

  • Introduces on-demand e2e test (triggered via /run-e2e commands) for Pull Requests on litmus-go repository via github actions using KIND clusters

  • Improves the e2e coverage for chaos experiments (pod-io-stress, node-io-stress, pod-autoscaler, abort support, target specification) via new tests in the pipeline based on the new additions/enhancements. The existing tests are improved with increased validation to test the success of the chaos injection procedures.

  • Adds a new GitLab pipeline with an initial set of e2e tests for Litmus Portal functions

  • Enhances the litmus-demo scripts to set up the EKS environment & execute the generic chaos suite (KIND & GKE are the other supported platforms)

  • Introduces documentation standards (and consequent update/refactor) around naming conventions for resource names, attribute names, - contribution guidelines as part of the SIG-Documentation deliberations.

  • Adds new content to litmus-docs - chaos monitoring, chaos CR schema explanations, probe enhancements, troubleshooting faq additions, etc.,

Major Bug Fixes

  • Fixes the bug wherein applications configured with liveness probes are stuck in CrashLoopBackOff state upon being subjected to network chaos (docker runtime) with revert chaos being unsuccessful. The network chaoslib now uses the container ID of the Kubernetes pause container associated with the target pod to inject the tc rules in the network namespace instead of the target app containers themselves (as they are prone to restart via liveness probes).

  • Fixes the Failed to connect to bus: No data available error on kubelet-service-kill chaoslib pod

  • Fixes the regex patterns used in the CRD validation schema to support non-specification of .spec.appinfo in the ChaosEngine (either in case of node-level/infra experiments or for broader, randomized selection of pods in the pod-level experiments)

  • Adds logic to exclude the chaos-resource pods (operator, runner, experiment & helper pods) from the target list in cases where the .spec.appinfo is not specified.

  • Fixes the behavior where the chaos-runner runs forever without terminating the experiment, in cases where the experiment job is not successfully started (ImagePullBackOff, Pending etc.,). The chaos-runner is now configured to use StatusCheckTimeout defined in the ChaosEngine (defaults to 180s) to terminate the experiment.

  • Fixes the inability to inject network-chaos when the ChaosExperiment CR is created with a different name (other than the default names on the chaoshub). The logic to select the netem params based on the fixed experiment names has been altered with dedicated functions for each variant of network chaos (latency, loss, duplication, corruption).

  • Fixes improper entrypoint/command to the containerd/crio container-kill & node-io-stress chaoslib (helper) pods

  • Fixes inability to revert (downscale replicas) the pod-autoscaler chaos in cases where the application namespace and chaos namespace are different (as with admin mode execution).

Major Known Issues & Limitations

Issue:

  • The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail - in spite of chaos being injected successfully - due to the unavailability of certain default utils in the target’s image that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration.

Workaround:

  • Users can identify the necessary commands to identify and kill the chaos processes and pass them to the experiment via env variable CHAOS_KILL_COMMAND. Alternatively, then can make use of the pumba chaoslib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Note: Expected to be fixed in a subsequent patch/minor release

Issue:

  • The pod-cpu-hog experiment using the pumba chaoslib can end ungracefully (after successfully injecting chaos for the specified duration) with this error: \x02\x00\x00\x00\x00\x00\x00\x1ecgroup change of group failed, randomly, on some platforms like EKS. In this case, the experiment verdict can tend to show up as Fail due to the chaoslib pod entering a failed state, despite the chaos being injected.

Workaround:

  • This is being investigated

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.9.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.9.0-RC1

13 Oct 14:15
183ff31
Compare
Choose a tag to compare
1.9.0-RC1 Pre-release
Pre-release
adding create configmap permission in subscriber manifest and few ref…

1.8.0

15 Sep 17:18
b8b4ade
Compare
Choose a tag to compare

New Features & Enhancements

  • Introduces the alpha-0 version of Litmus Portal. The portal helps you to execute & visualize chaos workflows, amongst many other things. Learn more about it here

  • Extends Litmus Probes with “Continuous” mode to validate the hypothesis around application behavior during chaos execution as against just at specific points/phases (start & end of chaos)

  • Adds Node & Pod level I/O stress chaos experiments with the ability to tune worker threads and filesystem usage, to the generic experiment suite.

  • Supports network chaos on Containerd & CRI-O runtimes, in addition to Docker.

  • Supports network chaos between distinct microservices (in addition to total interface level egress traffic chaos) specified by their IPs or hostnames/service FQDNs

  • Enhances the ChaosSchedule schema for repeat mode by adding IncludedHours & IncludedDays. The StartTime/EndTime definitions have been made optional to allow flexibility in being able to run from the point of creation of schedule CR or indefinitely until removal.

  • Migrates Cassandra ring disruption experiment to go-based chaoslib

  • Adds the ability to specify a target pod (env: TARGET_POD) or node (env: APP_NODE) as the application/resource under test, apart from randomized selections based on labels.

  • Enables the definition of blast radius for an application as a percentage value (PODS_AFFECTED_PERCENTAGE), by which an appropriate number of replicas undergo the specified chaos in parallel.

  • Improves the litmus chaoslib to take container fs & runtime socket file paths as tunables to support different Kubernetes platforms

  • Includes an additional pumba-based chaoslib for cpu/memory stress that uses external chaos containers (non-pod exec mode)

  • Adds chaos command tunables (for chaos injection & revert) for cpu/memory chaoslib (in pod exec mode) - in order to cover different base images & distros.

  • Supports broader filtering of pods within a namespace when no application labels are provided in .spec.appInfo. Users can also choose to skip the specification of application namespace explicitly, in which case the target pods are selected randomly from the ChaosEngine resource namespace.

  • Modifies the litmus chaos containers (operator, runner) to run with non-root users

  • Allows the definition of an INSTANCE_ID in the ChaosEngine to provide additional context or metadata to an experiment run. This also aids the creation of newer ChaosResult resources instead of patching/overwriting existing ones in case of repeated executions.

  • Improves the experiment code standards by fixing the issues listed in the GoGitOps report card for the litmus-go repository.

  • Generates events against the ChaosResult resource to indicate the experiment verdict (Pass, Fail, Stopped). These are useful in annotating monitoring dashboards with experiment results.

  • Enhances the Chaos Exporter to push chaos metrics to AWS CloudWatch

  • Improves the kubernetes-chaos helm chart by including options in the values.yaml to selectively install experiments via a whitelist/blacklist. Also maps the experiment names to reflect those on the ChaosHub.

  • Enhances the litmus-e2e with increased reporting around component-tests, the addition of e2e tests for new experiments, and Docker-based Gitlab runner for litmus-portal pipelines

  • Provides additional documentation based on experiment enhancements. Updates the get started documentation for general Kubernetes/OpenShift/Rancher platforms.

  • Enhances the litmus-demo scripts to generate a pdf report for the chaos experiments executed

  • Operationalizes the Litmus community Special Interest Groups (SIGs) for Documentation, Observability & Integrations.

Major Bug Fixes

  • Constructs ChaosResult name using experiment names passed from the ChaosExperiment resource instead of hardcoded experiment names

  • Fixes the chaos verification (whether chaos injection has occurred) steps in the container-kill experiment & retains the helper containers in case of errors for further debugging

  • Fixes the chaos event messages to be meaningful & include probe information only when the probes are defined

  • Removes the need for privileged containers to execute disk-fill chaos experiment

  • Handles the case where cpu/memory hog chaos processes are terminated or the target containers are OOM-Killed (this typically occurs when the memory hog/injection value exceeds resource limits set against the pods/containers). The error code 137 is handled appropriately with warning logs and the experiment proceeds with verification steps instead of erroring out/failing (the OOM-Kill is an expected behavior based on inputs provided)

  • Fixes the behavior in node-memory hog experiments where the provided input (percentage of node memory) is measured against the available memory instead of the total system memory

  • Propagates the custom chaos experiment annotations provided in the ChaosExperiment to the helper pods, if any. This is especially useful in cases where annotations decide scheduling or are mapped to certain IAM role/accounts etc.,

Deprecations & Breaking Changes

  • The instance count (.spec.schedule.instanceCount) property on the chaosSchedule has been deprecated in favor of maintaining just the minChaosInterval as a means of defining chaos cadence.

Major Known Issues & Limitations

Issue

  • The network chaos experiments (especially on docker runtime, using the litmus pumba lib) can end up with a Failed ChaosResult, and the app stuck in CrashLoopBackoff state in case of application deployments configured with liveness probes (that are set up to access health/service endpoints). Typically, this lib injects the tc netem rule against the interface by running a “chaos container” that attaches to the network namespace of the target container via the target’s container ID. The same ID is used in a subsequent container launched to revert the rule/chaos. However, with liveness probes, the container is restarted several times during the course of the chaos duration, causing the ID to change. The revert fails, with the network rule still persisting (courtesy the Kubernetes pause container for this app pod) leading to the app entering a CrashLoopBackOff state.

Current Workaround

  • Delete/reschedule the target pod manually to recreate the pause container/network namespace.
  • Use Target IPs or Hosts to inject the chaos b/w specific microservices while keeping the probe alive.

Note: This is expected to be fixed in a 1.8.x patch release

Issue

  • The kubelet-service-kill experiment makes use of systemctl to stop/start the service today. Running this experiment w/o an external LIB_IMAGE & leveraging the experiment image can throw the error Failed to connect to bus: No data available as the experiment runs with a non-root user.

Current Workaround

  • A standard Ubuntu image that runs as root can be used in a “helper” pod that injects this chaos. However, user-discretion is advised in terms of providing this access.

Issue

  • The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail, in spite of chaos being injected successfully - due to the unavailability of certain default utils (that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration) in the target’s image.

Workaround

  • Users can identify the necessary commands to derive and kill the chaos PIDs and pass them to the experiment via env variable CHAOS_KILL_COMMAND

  • Alternatively, they can make use of the chaos lib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Note: This is expected to be fixed in a 1.8.x patch release

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.8.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs

1.8.0-RC2

15 Sep 04:21
3c34d21
Compare
Choose a tag to compare
1.8.0-RC2 Pre-release
Pre-release
Merge pull request #2071 from rajdas98/cherry-pick-1.8.0-rc2

Cherry pick 1.8.0 rc2