Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] CSPL-3354: Add Lifecycle Hooks and Configurable Termination Grace Period to Splunk Operator #1424

Draft
wants to merge 2 commits into
base: feature/CSPL-3344
Choose a base branch
from

Conversation

vivekr-splunk
Copy link
Collaborator

Overview

This Pull Request introduces enhancements to the Splunk Operator by integrating Lifecycle Hooks and allowing customers to configure the Termination Grace Period via the Custom Resource (Common Spec). These changes aim to ensure graceful shutdowns of Splunk pods, thereby maintaining data integrity and improving the reliability of Splunk deployments on Kubernetes.

Problem Statement

Customers running Splunk on Kubernetes have reported issues related to abrupt pod terminations, especially during node recycling or maintenance operations. Without proper shutdown procedures, Splunk instances may not decommission gracefully, leading to potential data loss and increased operational churn. Additionally, the lack of configurable grace periods limits customers' ability to tailor shutdown behaviors to their specific environments and requirements.

Proposed Solution

  1. Integrate Lifecycle Hooks:

    • preStop Hook: Executes splunk offline and splunk stop commands before the pod is terminated. This ensures that Splunk instances decommission gracefully, preventing data corruption and loss.
  2. Configurable Termination Grace Period:

    • Custom Resource Update: Introduce a new field in the Common Spec of the Splunk Operator’s Custom Resource to allow customers to specify terminationGracePeriodSeconds.
    • Default Value: If not specified by the customer, a sensible default (e.g., 60 seconds) is applied to ensure sufficient time for graceful shutdowns.

Changes Made

  • Custom Resource Definition:

    • Added terminationGracePeriodSeconds under the commonSpec section to allow customization.
    apiVersion: enterprise.splunk.com/v4
    kind: IndexerCluster
    metadata:
      name: indexer-splunk
    spec:
        terminationGracePeriodSeconds: 120 # Customizable grace period in seconds
        # ... other common specifications
      # ... other cluster specifications
  • StatefulSet Template Update:

    • Modified the StatefulSet templates generated by the Splunk Operator to include the lifecycle section with the preStop hook.
    • Incorporated the terminationGracePeriodSeconds value from the Common Spec.
    spec:
      terminationGracePeriodSeconds: {{ .Spec.TerminationGracePeriodSeconds | default 60 }}
      containers:
        - name: splunk
          image: splunk/splunk:latest
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "splunk offline && splunk stop"]
          # ... other container configurations

Benefits

  • Graceful Shutdowns: Ensures that Splunk pods decommission properly, maintaining data integrity and reducing the risk of corruption.
  • Customization: Empowers customers to define their own termination grace periods based on their operational needs and Splunk’s shutdown requirements.
  • Improved Reliability: Minimizes unexpected downtime and operational issues related to abrupt pod terminations.
  • Kubernetes Best Practices: Aligns Splunk deployments with Kubernetes lifecycle management best practices, enhancing overall deployment robustness.

Related Issues

  • Closes #CSPL-3354: Implement lifecycle hooks for graceful pod shutdowns. Add configurable termination grace period to Splunk Operator Custom Resource.

Testing Performed

  1. Unit Tests:

    • Verified that the terminationGracePeriodSeconds from the Custom Resource is correctly applied to the StatefulSet.
    • Ensured that the preStop lifecycle hook executes the appropriate Splunk commands.
  2. Integration Tests:

    • Deployed the updated Splunk Operator in a staging environment.
    • Simulated pod terminations and confirmed that splunk offline and splunk stop commands were executed before termination.
    • Tested with different terminationGracePeriodSeconds values to ensure flexibility and correctness.
  3. Manual Testing:

    • Conducted node recycling operations to observe the behavior of Splunk pods during graceful shutdowns.
    • Verified that no data loss or corruption occurred during pod recycling.

Documentation Updates

  • Operator README:

    • Added sections detailing the new terminationGracePeriodSeconds field in the Custom Resource.
    • Provided examples demonstrating how to configure lifecycle hooks and grace periods.
  • Configuration Guides:

    • Updated guides to include best practices for setting terminationGracePeriodSeconds based on different deployment scenarios.

How to Test

  1. Update Custom Resource:

    • Modify the terminationGracePeriodSeconds in your Splunk Operator Custom Resource.
  2. Deploy or Update Splunk Cluster:

    • Apply the updated Custom Resource to deploy or update your Splunk cluster.
  3. Verify StatefulSet Configuration:

    • Ensure that the StatefulSet includes the preStop lifecycle hook and the correct terminationGracePeriodSeconds.
  4. Simulate Pod Termination:

    • Manually delete a Splunk pod and observe the execution of the preStop hook.
    • Confirm that Splunk gracefully shuts down before the pod is terminated.

Future Considerations

  • Enhanced Shutdown Commands: Explore the possibility of using splunk decommission if it provides more comprehensive shutdown procedures compared to splunk offline and splunk stop.
  • Dynamic Configuration: Allow for dynamic updates to the terminationGracePeriodSeconds without requiring full cluster redeployments.
  • Monitoring and Alerts: Integrate monitoring to track the execution and success of lifecycle hooks, providing alerts in case of failures.

Reviewer Notes

  • Backward Compatibility: Ensure that existing deployments without the terminationGracePeriodSeconds field continue to operate with the default grace period.
  • Security Considerations: Validate that the execution of shutdown commands does not introduce security vulnerabilities or expose sensitive information.
  • Performance Impact: Assess any potential performance implications of the added lifecycle hooks during pod terminations.

Pull Request Checklist:

  • Code changes adhere to the project's coding standards.
  • Relevant unit and integration tests are included.
  • Documentation has been updated accordingly.
  • All tests pass locally.
  • The PR description follows the project's guidelines.

@vivekr-splunk vivekr-splunk changed the title CSPL-3354: Add Lifecycle Hooks and Configurable Termination Grace Period to Splunk Operator [Draft] CSPL-3354: Add Lifecycle Hooks and Configurable Termination Grace Period to Splunk Operator Jan 27, 2025
@vivekr-splunk vivekr-splunk self-assigned this Jan 27, 2025
@vivekr-splunk vivekr-splunk marked this pull request as draft January 27, 2025 22:55
@vivekr-splunk vivekr-splunk changed the base branch from develop to feature/CSPL-3344 January 27, 2025 23:00
@@ -3,7 +3,7 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.16.1
controller-gen.kubebuilder.io/version: v0.14.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we are switching to an older version here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well no, I think I have older version of kubebuilder , I will upgrade it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants