Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline logs are disappearing after 24h #1120

Open
AxoyTO opened this issue Oct 18, 2024 · 1 comment
Open

Pipeline logs are disappearing after 24h #1120

AxoyTO opened this issue Oct 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@AxoyTO
Copy link

AxoyTO commented Oct 18, 2024

Bug Description

After 24 hours of creation, all logs belonging to a pipeline run disappear from the Charmed Kubeflow UI, despite the logs still being present in MinIO/mlpipeline (AWS S3). This leads to difficulty in troubleshooting and tracking the progress or failures of pipeline runs after the 24-hour period.

image

!aws --endpoint-url $MINIO_ENDPOINT_URL s3 ls s3://mlpipeline
                           [...]
                           PRE addition-pipeline-4g94d/
                           PRE addition-pipeline-4qwt4/
                           PRE download-preprocess-train-deploy-pipeline-8wjv9/
                           PRE mnist-pipeline-fcmgr/
                           [...]
!aws --endpoint-url $MINIO_ENDPOINT_URL s3 ls s3://mlpipeline/download-preprocess-train-deploy-pipeline-8wjv9/download-preprocess-train-deploy-pipeline-8wjv9-system-container-impl-1190848556/
2024-10-15 15:27:49      10796 main.log

To Reproduce

  1. Deploy Charmed Kubeflow 1.9 using Juju.
  2. Create a pipeline and run it.
  3. After the run completes, observe that logs are available in the Kubeflow UI.
  4. Wait for 24 hours after the pipeline run completes.
  5. Attempt to view the pipeline logs in the UI again.
    Expected: Logs should still be accessible.
    Actual: Logs are no longer visible in the UI, but are still present in the underlying MinIO/mlpipeline (AWS S3).

Environment

CKF: 1.9/stable
minio: ckf-1.9/stable
argo-controller: 3.4/stable
Juju: 3.5.4
See the full bundle on: https://paste.ubuntu.com/p/NXXFhDqmVn/

Relevant Log Output

<none>

Additional Context

Notebook that is used to create a pipeline, which was ran on a notebook server with a GPU:

import kfp
from kfp import dsl, kubernetes

@dsl.component(
    base_image="tensorflow/tensorflow:latest-gpu",
    # packages_to_install=["tensorflow"]
)
def foo():
    '''Calculates sum of two arguments'''
    print("GPU Test")
    import tensorflow as tf
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    print("GPU Test")


@dsl.pipeline(
    name='Addition pipeline',
    description='An example pipeline that performs addition calculations.')
def foo_pipeline():
    task = (foo()
            .set_cpu_request(str(2))
            .set_cpu_limit(str(4))
            .set_memory_request("2Gi")
            .set_memory_limit("4Gi")
            .set_gpu_limit("1")
            .set_accelerator_type("nvidia.com/gpu")
           )

    task = kubernetes.set_image_pull_policy(task=task, policy="Always")

    task = kubernetes.add_toleration(
        task=task,
        key="sku",
        operator="Equal",
        value="gpu",
        effect="NoSchedule",
    )

namespace = "admin"

client = kfp.Client()

run = client.create_run_from_pipeline_func(
    run_name="gpu_test",
    pipeline_func=foo_pipeline,
    namespace=namespace,
    experiment_name="gpu-foo")

Could be related to upstream: kubeflow/pipelines#7617

@AxoyTO AxoyTO added the bug Something isn't working label Oct 18, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6494.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant