Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafusion Query Planning tests in CI #9

Closed
Tracked by #18
edmondop opened this issue Oct 2, 2024 · 6 comments
Closed
Tracked by #18

Datafusion Query Planning tests in CI #9

edmondop opened this issue Oct 2, 2024 · 6 comments

Comments

@edmondop
Copy link
Contributor

edmondop commented Oct 2, 2024

Quoting @andygrove on Discord:

we have rust tests for query planning, but they require some tpc-h data to be available
ideally we should be running the tpc-h queries against a small data set (sf=1) in CI and checking that the results are correct
I expect we can re-use some of the CI setup from core DataFusion

@edmondop
Copy link
Contributor Author

edmondop commented Oct 2, 2024

At the moment, we already run some query planning tests, see https://github.com/edmondop/datafusion-ray/actions/runs/11133252895/job/30938921925

@andygrove
Copy link
Member

At the moment, we already run some query planning tests, see https://github.com/edmondop/datafusion-ray/actions/runs/11133252895/job/30938921925

These tests do not actually do anything currently, unless they run on my computer 😞

async fn do_test(n: u8) -> Result<()> {
    let data_path = "/mnt/bigdata/tpch/sf10-parquet";
    if !Path::new(&data_path).exists() {
        return Ok(());
    }

@andygrove
Copy link
Member

We have a Python script at https://github.com/apache/datafusion-benchmarks/tree/main/tpch for generating TPC-H data:

https://github.com/apache/datafusion-benchmarks/tree/main/tpch

We could call this in CI and then figure out how to update theses tests to use that data

@edmondop
Copy link
Contributor Author

edmondop commented Oct 2, 2024

Excellent, I'll do this as a first iteration, thank for the hint. Then we can decide how to do the comparison between datafusion results and ray results, what do you think?

@andygrove
Copy link
Member

Excellent, I'll do this as a first iteration, thank for the hint. Then we can decide how to do the comparison between datafusion results and ray results, what do you think?

Sounds good.

We could run each query first with DF Python and then with DF Ray and compare the results. This could be a good general approach to testing.

@edmondop edmondop changed the title Datafusion Ray TPCH tests in CI Datafusion Query Planning tests in CI Oct 2, 2024
@edmondop
Copy link
Contributor Author

edmondop commented Oct 4, 2024

Closed with #12 #16 #17

@edmondop edmondop closed this as completed Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants