Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HTTP transport in thrift metastore client #17925

Closed
wants to merge 3 commits into from

Conversation

vihangk1
Copy link
Member

@vihangk1 vihangk1 commented Jun 15, 2023

Description

This PR adds support for using HTTP as the transport for the thrift metastore client. In this mode the hive.metastore.uri can support a http(s) URL to the metastore service. Since each thrift API is a POST request to the the metastore endpoint, this PR also adds support for adding a HTTP bearer token which must be configured to authenticate the metastore client to the metastore server. The token can be configured using a configuration hive.metastore.http.client.bearer-token. To support Databricks Unity Catalog's HMS API interface, the configuration hive.metastore.http.client.additional-headers must be set to X-Databricks-Unity-Catalog-Name=<catalog_name> where the catalog_name is the name of the catalog object in Databricks Unity Catalog.

Additional context and related issues

Thrift has support HTTP based transport for a long time. Ref https://github.com/apache/thrift/blob/master/lib/javame/src/org/apache/thrift/transport/THttpClient.java. We have used the http mode for thrift on various services in other projects like HiveServer2 in Apache Hive and Apache Spark for a while now.

More recently, Apache Hive also added support for Hive Metastore server to use HTTP mode.
https://issues.apache.org/jira/browse/HIVE-21456

This PR modifies the logic in DefaultThriftMetastoreClientFactory to instantiate THttpClient for the transport when creating the thrift metastore client.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

Add HTTP support for the thrift metastore client.

# Section
* Fix some things. ({issue}17865)

Fixes #17865

@cla-bot
Copy link

cla-bot bot commented Jun 15, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added hive Hive connector tests:hive labels Jun 15, 2023
@vihangk1 vihangk1 closed this Jun 15, 2023
@vihangk1 vihangk1 reopened this Jun 15, 2023
@vihangk1 vihangk1 force-pushed the master_thrift-http branch from 1d19c77 to 47ca558 Compare June 15, 2023 21:32
@cla-bot
Copy link

cla-bot bot commented Jun 15, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@martint
Copy link
Member

martint commented Jun 20, 2023

@cla-bot check

@cla-bot cla-bot bot added the cla-signed label Jun 20, 2023
@cla-bot
Copy link

cla-bot bot commented Jun 20, 2023

The cla-bot has been summoned, and re-checked this pull request!

@vihangk1 vihangk1 self-assigned this Jun 20, 2023
@cla-bot
Copy link

cla-bot bot commented Jun 28, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Vihang Karajgaonkar.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla-signed label Jun 28, 2023
@vihangk1 vihangk1 force-pushed the master_thrift-http branch from 75f7992 to 556adc5 Compare June 28, 2023 01:15
@cla-bot cla-bot bot added the cla-signed label Jun 28, 2023
@vihangk1 vihangk1 force-pushed the master_thrift-http branch 2 times, most recently from f44a1bf to 40eb59b Compare June 29, 2023 02:40
@github-actions github-actions bot added the hudi Hudi connector label Jun 29, 2023
@vihangk1
Copy link
Member Author

Hi @electrum Would you like to take another look? I think I addressed all your review comments.

@electrum
Copy link
Member

electrum commented Jul 5, 2023

Can you squash all of the commits?

@vihangk1 vihangk1 force-pushed the master_thrift-http branch from f7d82b7 to 1174392 Compare July 6, 2023 17:53
@vihangk1
Copy link
Member Author

vihangk1 commented Jul 6, 2023

@electrum Squashed all commits into one and force pushed.

Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this. Looking at the config closer, rather than adding hive.metastore.thrift.transport-mode, we should use URI from hive.metastore.uri to determine the protocol. That's actually why we used a URI in the first place, rather than just a host/port. This will also allow mixing raw and HTTP endpoints.

@vihangk1
Copy link
Member Author

I think that makes sense. Let me update the PR to derive the HTTP mode directly from the URI scheme.

@vihangk1
Copy link
Member Author

vihangk1 commented Jul 10, 2023

Thanks for updating this. Looking at the config closer, rather than adding hive.metastore.thrift.transport-mode, we should use URI from hive.metastore.uri to determine the protocol. That's actually why we used a URI in the first place, rather than just a host/port. This will also allow mixing raw and HTTP endpoints.

@electrum Ah, I remember now why I had to introduce a http mode config. The interface to create the metastore client takes a HostAndPort unfortunately here

ThriftMetastoreClient create(HostAndPort address, Optional<String> delegationToken)

Due to this we don't have the URI scheme to decide if we should use a http client or a binary client. I can change the interface to include uri but then the changes trickle down to other files as well and the scope of the PR increases. I am okay to do that but just wanted to make sure that I am understanding this right before getting into the changes.

@vihangk1
Copy link
Member Author

I updated the PR to reuse the metastore URI scheme to determine if we should use the http thrift transport. I added a separate commit so that we can see what all changes are needed just to reuse the scheme from the metastore uri directly.

@github-actions github-actions bot added iceberg Iceberg connector delta-lake Delta Lake connector labels Jul 13, 2023
@ebyhr ebyhr force-pushed the master_thrift-http branch from 191c15e to 9302b9a Compare October 27, 2023 07:23
@ebyhr
Copy link
Member

ebyhr commented Oct 27, 2023

@vihangk1
Copy link
Member Author

https://github.com/trinodb/trino/actions/runs/6664185107/job/18111804372?pr=19085

I think the product test in the link above is failing due to some infrastructure issue. It is failing because it cannot find the jdbc driver to connect to the databricks cluster. The pt works for me when I run locally on my environment.

@vihangk1 vihangk1 force-pushed the master_thrift-http branch 2 times, most recently from 9d60718 to 69a386e Compare October 29, 2023 06:02
@ebyhr
Copy link
Member

ebyhr commented Oct 30, 2023

@vihangk1
Copy link
Member Author

https://github.com/trinodb/trino/actions/runs/6688269641/job/18170433430?pr=19085

I saw the error. those tests are failing due to cluster not being found (see below). Can we check if the databricks cluster terminated during that time?

RESOURCE_DOES_NOT_EXIST: No cluster found matching: 0907-043916-ahg9aqt3

@ebyhr ebyhr force-pushed the master_thrift-http branch from 9bc89ec to 64f1e65 Compare November 2, 2023 08:35
plugin/trino-hive/pom.xml Outdated Show resolved Hide resolved
ThriftMetastoreConfig metastoreConfig = buildConfigObject(ThriftMetastoreConfig.class);
ThriftHttpMetastoreConfig httpMetastoreConfig = buildConfigObject(ThriftHttpMetastoreConfig.class);

if (!staticMetastoreConfig.getMetastoreUris().isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does getMetastoreUris return empty? It would be nice to deny if it's unexpected:

checkArgument(!staticMetastoreConfig.getMetastoreUris().isEmpty(), "...");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, I think there were some tests which were failing which caused me add this line. But I cannot find which ones were those. I have added this precondition check for now to see if the CI reveals those failing tests again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I add a checkArgument condition here, I see test failures for TestHiveConnectoryFactory which expect a different error message asserted in StaticTokenAwareMetastoreClientFactory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the precondition check to checkNotNull for the metastoreUris. If the metastoreUris is empty the validateThriftHttpConfiguration becomes no-op and the higher layers like StaticTokenAwareMetastoreClientFactory will assert the metastoreUris are non-empty.

@vihangk1 vihangk1 force-pushed the master_thrift-http branch 2 times, most recently from 502e7c8 to 9ffa0b3 Compare November 7, 2023 06:46
@vihangk1 vihangk1 force-pushed the master_thrift-http branch 2 times, most recently from 28eea16 to 8ce2605 Compare November 9, 2023 18:36
@vihangk1
Copy link
Member Author

vihangk1 commented Nov 9, 2023

rebased and resolved conflicts.

@vihangk1 vihangk1 force-pushed the master_thrift-http branch 2 times, most recently from 558dbfe to 7952c33 Compare November 10, 2023 23:04
@vihangk1
Copy link
Member Author

The CI failure looks unrelated. It failed due to "Terminating due to java.lang.OutOfMemoryError: Java heap space"

@ebyhr ebyhr requested a review from electrum November 15, 2023 01:40
@shashanksinghfd
Copy link

Thanks @vihangk1 for keeping the PR alive even after 5 months of to and fro , in true spirit of open source :)
thanks maintainers too, hope this gets merged .

@sajjoseph
Copy link
Contributor

Any updates? It will be great if we can get this merged so that Trino can integrate Unity catalog.

@findinpath
Copy link
Contributor

Adding here a snippet from a Slack conversation with @vihangk1

public class ThriftMetastoreModule
        extends AbstractConfigurationAwareModule
{
    @Override
    protected void setup(Binder binder)
    {



        binder.bind(TokenAwareMetastoreClientFactory.class).to(StaticTokenAwareMetastoreClientFactory.class).in(Scopes.SINGLETON);
        configBinder(binder).bindConfig(StaticMetastoreConfig.class);



        StaticMetastoreConfig staticMetastoreConfig = buildConfigObject(StaticMetastoreConfig.class);
        boolean hasHttpMetastore = staticMetastoreConfig.getMetastoreUris().stream()
                .anyMatch(uri -> "http".equalsIgnoreCase(uri.getScheme()));
        boolean hasHttpsMetastore = staticMetastoreConfig.getMetastoreUris().stream()
                .anyMatch(uri -> "https".equalsIgnoreCase(uri.getScheme()));

        if (hasHttpMetastore || hasHttpsMetastore) {
            configBinder(binder).bindConfig(ThriftHttpMetastoreConfig.class);
            // bind the factory
        } else {
            configBinder(binder).bindConfig(ThriftMetastoreConfig.class);
            OptionalBinder.newOptionalBinder(binder, ThriftMetastoreClientFactory.class)
                    .setDefault().to(DefaultThriftMetastoreClientFactory.class).in(Scopes.SINGLETON);

            // ....
        }

I was suggesting him to separate what is related to thrift http configuration in a separate class which would be bound only if we're dealing with a http metastore.

@vihangk1 do you have any free cycles in the near future to continue with this PR?

@vihangk1
Copy link
Member Author

Adding here a snippet from a Slack conversation with @vihangk1

public class ThriftMetastoreModule
        extends AbstractConfigurationAwareModule
{
    @Override
    protected void setup(Binder binder)
    {



        binder.bind(TokenAwareMetastoreClientFactory.class).to(StaticTokenAwareMetastoreClientFactory.class).in(Scopes.SINGLETON);
        configBinder(binder).bindConfig(StaticMetastoreConfig.class);



        StaticMetastoreConfig staticMetastoreConfig = buildConfigObject(StaticMetastoreConfig.class);
        boolean hasHttpMetastore = staticMetastoreConfig.getMetastoreUris().stream()
                .anyMatch(uri -> "http".equalsIgnoreCase(uri.getScheme()));
        boolean hasHttpsMetastore = staticMetastoreConfig.getMetastoreUris().stream()
                .anyMatch(uri -> "https".equalsIgnoreCase(uri.getScheme()));

        if (hasHttpMetastore || hasHttpsMetastore) {
            configBinder(binder).bindConfig(ThriftHttpMetastoreConfig.class);
            // bind the factory
        } else {
            configBinder(binder).bindConfig(ThriftMetastoreConfig.class);
            OptionalBinder.newOptionalBinder(binder, ThriftMetastoreClientFactory.class)
                    .setDefault().to(DefaultThriftMetastoreClientFactory.class).in(Scopes.SINGLETON);

            // ....
        }

I was suggesting him to separate what is related to thrift http configuration in a separate class which would be bound only if we're dealing with a http metastore.

@vihangk1 do you have any free cycles in the near future to continue with this PR?

Thanks @sajjoseph and @findinpath for your comments. I could not find enough time to investigate the approach which @findinpath suggested above in the last month. I made some progress on that and I just need to address some test failures after my changes. I will send out a PR update by this weekend with the suggested approach.

@vihangk1
Copy link
Member Author

vihangk1 commented Feb 5, 2024

Closing this PR in favor of #20371

@vihangk1 vihangk1 closed this Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector docs hive Hive connector hudi Hudi connector iceberg Iceberg connector
Development

Successfully merging this pull request may close these issues.

Add support for HTTP transport to the thrift hive metastore client