Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trino returning zero records for Hudi tables in AWS Glue #24654

Open
drautela-scwx opened this issue Jan 8, 2025 · 1 comment
Open

Trino returning zero records for Hudi tables in AWS Glue #24654

drautela-scwx opened this issue Jan 8, 2025 · 1 comment

Comments

@drautela-scwx
Copy link

We are trying to setup the following in AWS:

Trino on EKS: version 465
HMS: Glue. We have database for both Hudi and non-Hudi external tables for parquet files.

Currently we are able to query non-Hudi tables successfully.

For Hudi tables the query executes without any error but returns zero records. We have confirmed that records should be returned by executing the same query in Athena.

We have two connectors setup as follows:

(1) /etc/trino/catalog/awsdatacatalog.properties

connector.name=hive
hive.metastore=glue
hive.hive-views.enabled=true
hive.partition-projection-enabled=true
fs.native-s3.enabled=true
hive.hudi-catalog-name=hudi

(2) /etc/trino/catalog/hudi.properties

connector.name=hudi
hive.metastore=glue
fs.native-s3.enabled=true

We do have partition projection enabled for Hudi tables.

Zero records are returned for Hudi tables whether we use awsdatacatalog or hudi catalog

select *
from awsdatacatalog.hudi_db.test_table_ro
where partition1 = "123"
and partition2 = "abc"
limit 10;
select *
from hudi.hudi_db.test_table_ro
where partition1 = "123"
and partition2 = "abc"
limit 10;

Any suggestions on how to figure out the reason for zero records being returned? Any specific logging that could be turned on that would help in debugging this issue?

@drautela-scwx
Copy link
Author

Does hudi connector for Trino support partition projection?

Below is the sample DDL for the HUDI table in Glue catalog:

CREATE EXTERNAL TABLE `test_table_ro`(
  `_hoodie_commit_time` string COMMENT '', 
  `_hoodie_commit_seqno` string COMMENT '', 
  `_hoodie_record_key` string COMMENT '', 
  `_hoodie_partition_path` string COMMENT '', 
  `_hoodie_file_name` string COMMENT '', 
  `resource_id` string COMMENT '')
PARTITIONED BY ( 
  `tenant` string COMMENT '', 
  `date` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'hoodie.query.as.ro.table'='true', 
  'path'='s3://datalake') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://datalake'
TBLPROPERTIES (
  'hudi.metadata-listing-enabled'='FALSE', 
  'last_commit_time_sync'='20240821120712520', 
  'projection.date.format'='yyyyMMdd', 
  'projection.date.range'='19700101,99990101', 
  'projection.date.type'='date', 
  'projection.enabled'='true', 
  'projection.tenant.range'='-1,8675309', 
  'projection.tenant.type'='integer', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numPartCols'='2', 
  'spark.sql.sources.schema.numParts'='4', 
  'spark.sql.sources.schema.part.0'='{...}', 
  'spark.sql.sources.schema.partCol.0'='tenant', 
  'spark.sql.sources.schema.partCol.1'='date')```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant