Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Open FlyteFile from remote path #2991

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Dec 8, 2024

Tracking issue

Closes flyteorg/flyte#6090.

Why are the changes needed?

Directly opening a local or remote file with FlyteFile interface is quite useful. Also, this is commonly used with a context manager like:

with open(ff, "r") as f:
    # Read your file here
    # ...

Following discuss both the local and remote cases:

  1. Directly open FlyteFile from a local path
ff = FlyteFile(path=local_path)
with open(ff, "r") as f:
    # Read your local file here
    # ...

In this case, we don't need to download the file because it's already on the local file system. Hence, we just run a dummy downloading.

  1. Directly open FlyteFile from a remote path (our focus in this PR)
ff = FlyteFile(path=remote_path)
with open(ff, "r") as f:
    # Read your remote file here
    # ...

As for the remote file, we definitely need to download the file from remote to our local file system before the file is accessed.

With this fix, users can directly open either a local or remote file in a consistent manner.

What changes were proposed in this pull request?

  1. Make downloader a static method of FlyteFilePathTransformer.
    • Its behavior is logically related to FlyteFilePathTransformer but doesn't need to interact with class or instance data.
  2. Explicitly setup remote file source and local file destination in __fspath__ magic method of FlyteFile.
    • Support remote file downloading.

How was this patch tested?

This patch is first tested with the following code snippet:

import os

from flytekit import task
from flytekit.types.file import FlyteFile


# Remote path
remote_path = "s3://my-s3-bucket/test.json"


@task
def create_ff() -> FlyteFile:
    ff = FlyteFile(path=remot_path)
    with open(ff, "r") as f:
        content = f.read()
        print(content)

    return ff


if __name__ == "__main__":
    from flytekit.clis.sdk_in_container import pyflyte
    from click.testing import CliRunner

    runner = CliRunner()
    path = os.path.realpath(__file__)
    result = runner.invoke(pyflyte.main, ["run", path, "create_ff"])
    print(result.output)

The result is shown as follows:

Screenshot 2024-12-08 at 9 58 23 PM

Setup process

git clone https://github.com/flyteorg/flytekit.git
gh pr checkout 2991
make setup && pip install -e .

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@JiangJiaWei1103
Copy link
Contributor Author

JiangJiaWei1103 commented Dec 8, 2024

Follow-ups

  1. Write integration test
    • Still pondering how to enable s3 interaction in integration test.
  2. Experiment on FlyteDirectory to see if there's similar issue

@Future-Outlier Future-Outlier self-assigned this Dec 9, 2024
@eapolinario
Copy link
Collaborator

eapolinario commented Dec 18, 2024

Thank you for your change, this indeed looks useful! I have one suggestion:

Write integration test

  • Still pondering how to enable s3 interaction in integration test.

You should be able to add FlyteFile specific integration tests that hit the minio container in the sandbox. Take a look at https://github.com/flyteorg/flytekit/blob/master/.github/workflows/pythonbuild.yml#L264-L266, if you specify those environment variables you should be able to write to an minio s3 bucket.

@JiangJiaWei1103
Copy link
Contributor Author

Hi @eapolinario,

Amazing, thanks very much for the guide. I'll add the corresponding integration test asap!

@JiangJiaWei1103
Copy link
Contributor Author

Hi @eapolinario,

This patch is now covered by the newly added integration test. If there's anything that can be improved, please let me know. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

[BUG] FlyteFile from a remote path should be read correctly in a task
3 participants