Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoalescePartitionsExec requires at least one input partition when aggregating empty tables on single-core machines #186

Closed
mildbyte opened this issue Oct 31, 2022 · 3 comments

Comments

@mildbyte
Copy link
Contributor

mildbyte commented Oct 31, 2022

This only happens on Fly.io (I think I tested this with the same Docker image locally and didn't get the issue):

$ curl -iH "Content-Type: application/json" https://seafowl/q -d '{"query": "CREATE TABLE test2 (key INTEGER, value TEXT)"}'                                                                                                
HTTP/2 200                                                                                                                                                                                                                                                                                                                    
$ curl -iH "Content-Type: application/json" https://seafowl/q -d '{"query": "SELECT COUNT(*) FROM test2"}'                                                                                                                  
HTTP/2 400                                                                                                                                                                                                                                                                                                                    
Internal error: CoalescePartitionsExec requires at least one input partition. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Local Docker (TODO: quadruple-check it's definitely the same Docker image as the one on Fly):

 ~ $ curl -iH "Content-Type: application/json" http://localhost:8080/q -d '{"query": "CREATE TABLE test (key INTEGER, value TEXT)"}'
HTTP/1.1 200 OK
content-type: application/octet-stream
vary: Content-Type, Origin, X-Seafowl-Query
content-length: 0
date: Mon, 31 Oct 2022 16:48:31 GMT

 ~ $ curl -iH "Content-Type: application/json" http://localhost:8080/q -d '{"query": "SELECT COUNT(*) FROM test"}'
HTTP/1.1 200 OK
content-type: application/octet-stream
vary: Content-Type, Origin, X-Seafowl-Query
content-length: 22
date: Mon, 31 Oct 2022 16:48:40 GMT

{"COUNT(UInt8(1))":0}
@mildbyte mildbyte changed the title CoalescePartitionsExec requires at least one input partition when aggregating empty tables on Fly.io CoalescePartitionsExec requires at least one input partition when aggregating empty tables Oct 31, 2022
@mildbyte
Copy link
Contributor Author

Doesn't just happen on Fly.io, I had an older debug version in my Docker, could be a regression in a recent DF.

@mildbyte mildbyte changed the title CoalescePartitionsExec requires at least one input partition when aggregating empty tables CoalescePartitionsExec requires at least one input partition when aggregating empty tables on single-core machines Nov 2, 2022
@mildbyte
Copy link
Contributor Author

mildbyte commented Nov 2, 2022

It looks like it's because of the number of cores allocated (becomes num_cpus and then feeds into DF as the target_partition_count setting.

1 core (with ./taskset -c 1 seafowl):

ProjectionExec: expr=[COUNT(UInt8(1))@0 as COUNT(UInt8(1))]      +
   AggregateExec: mode=Final, gby=[], aggr=[COUNT(UInt8(1))]      +
     CoalescePartitionsExec                                       +
       AggregateExec: mode=Partial, gby=[], aggr=[COUNT(UInt8(1))]+
         ParquetExec: limit=None, partitions=[], projection=[key] +
 

multicore:

ProjectionExec: expr=[COUNT(UInt8(1))@0 as COUNT(UInt8(1))]       +
   AggregateExec: mode=Final, gby=[], aggr=[COUNT(UInt8(1))]       +
     CoalescePartitionsExec                                        +
       AggregateExec: mode=Partial, gby=[], aggr=[COUNT(UInt8(1))] +
         RepartitionExec: partitioning=RoundRobinBatch(4)          +
           ParquetExec: limit=None, partitions=[], projection=[key]+

The physical plan for single-core is missing RepartitionExec.

mildbyte added a commit that referenced this issue Nov 2, 2022
With one partition (the default if `num_cpus=1`), we seem to hit some DataFusion
bugs:

- #186
- potentially #185

As a temporary workaround, pretend we always need at least 2 partitions, which
makes DataFusion use alternative query plans.
mildbyte added a commit that referenced this issue Nov 2, 2022
With one partition (the default if `num_cpus=1`), we seem to hit some DataFusion
bugs:

- #186
- potentially #185

As a temporary workaround, pretend we always need at least 2 partitions, which
makes DataFusion use alternative query plans.
@gruuya
Copy link
Contributor

gruuya commented May 26, 2023

Seems not to be occurring anymore, even without the fix from #189 (which was removed in #422); closing.

@gruuya gruuya closed this as completed May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants