-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alluxio cache #18719
Alluxio cache #18719
Conversation
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/cache/CachingFileSystemConfig.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/execution/scheduler/NodeSchedulerConfig.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/CachingHdfsInputFile.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/cache/CachingFileSystemConfig.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/cache/CachingFileSystemConfig.java
Outdated
Show resolved
Hide resolved
3325fd5
to
659298d
Compare
would also love to see this or 16375 merged. |
lib/trino-hdfs/src/main/java/io/trino/filesystem/hdfs/CachingHdfsInputFile.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/hdfs/util/ConsistentHashingNodeProvider.java
Outdated
Show resolved
Hide resolved
String cacheIdentifier = hashFunction.hashString(path + lastModifiedTime, UTF_8).toString(); | ||
URIStatus uriStatus = new URIStatus(info, CacheContext.defaults().setCacheIdentifier(cacheIdentifier)); | ||
return new FSDataInputStream(new HdfsFileInputStream( | ||
new LocalCacheFileInStream(uriStatus, (uri) -> new AlluxioHdfsInputStream(fileSystem.open(file)), cacheManager, alluxioConf), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use soft scheduling, this might end up unnecessarily caching data on nodes whenever soft scheduling fails. That is, if a piece of data A is supposed to be cached on node 1, but node 1 is too busy, it might be scheduled to run on node 2. Then node 2 will read A from cloud storage, and add it to its cache. However, the next time data A is supposed to be read, and node 1 is still too busy, we are just as likely to schedule the read to happen on node 3. So we will just end up consuming write capacity and cache capacity from node 2, without really getting any benefit of caching. This will especially be a problem in a cluster with a lot of nodes.
The nodes need a way to know if they are supposed to cache a piece of data or not, and fall back to not caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be possible to mitigate by populating a list of preferred host addresses in the connector split instead of just one host
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this potentially reduce the cache hit rate? For instance, let's say we have 100 nodes, each with a 1 TB cache, a 100 TB of data we want to query, and all data is only cached on a single node. If soft scheduling never fails we will eventually have a 100% cache hit rate. If soft scheduling fails 10% of the time, 10% of the data will be scheduled on a random node not supposed to cache this data, and store "garbage" in its cache. Eventually 10% of all cached data is not cached on a node with the correct cache key. This should give roughly an 80% cache hit rate.
If on the other hand we cache all data on two nodes, then only 50 TB of the data we want to query can be cached. This should give a cache hit rate of 50%, even if soft scheduling never fails.
If nodes only cache data they are supposed to cache, we would get an 90% cache hit rate, if soft scheduling fails 10% of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a problem as long as you deterministically provide the same list of preferred host addressees for a given split and the scheduler attempts to schedule splits on these hosts in the provided preferred order. I'm assuming that the probability that all the preferred hosts are too busy to scheduled splits is low with 3 preferred hosts. You might cache some data in multiple nodes but that would be useful when the primary preferred node is too busy.
A couple of alternatives to this are:
- The worker node checks that it is not the preferred host from the split and uses the non-cached file system implementation when it's not. This would give up on caching if the preferred host is too busy to schedule splits on.
- The embedded cache implementation itself has the the ability to remotely read cached data from any worker. This way even if a split gets scheduled on a node which didn't cache the data, it can still read the cached data from another node. This is the approach that Rubix takes. However, this is probably not worth the added complexity.
lib/trino-hdfs/src/main/java/io/trino/hdfs/util/NodeProvider.java
Outdated
Show resolved
Hide resolved
lib/trino-hdfs/src/main/java/io/trino/hdfs/util/NodeProvider.java
Outdated
Show resolved
Hide resolved
.setPath(path) | ||
.setFolder(false) | ||
.setLength(length()); | ||
String cacheIdentifier = hashFunction.hashString(path + lastModifiedTime, UTF_8).toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit unfortunate that we have to do the caching on the file level, and not on the part-of-file level. With the current caching scheme a single large file from cloud storage will be cached on a single node. For a large and hot file this could potentially lead to problems. But it might be hard to solve, and might not be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@beinan could you please weigh in on whether this is possible to improve ?
An update on this here - based on the feedback above, we're reworking the PR to:
We're planning to polish it a bit internally, then bring it up here for discussion probably next week. |
Amazing thank you for this work. |
659298d
to
d5118f6
Compare
Hey, a small update here. We have just pushed our latest changes with the promised refactor. There is likely still some work to do, especially around testing. Still, we hope this code structure can be a good point for continuing with the review. To summarize some of the changes and open questions:
If anyone wants to test this PR with other connectors, please be aware that the |
d5118f6
to
5684fe2
Compare
Thanks for your work on this. Can you move Is the Alluxio cache code fundamentally tied to Hadoop? I see |
5684fe2
to
8132290
Compare
I've moved it to a new module |
@jkylling can you rebase to resolve conflict? |
Yes, I'm looking at it. It's a bit tricky as the alluxio-shaded client contains a lot of classes which can overlap with other dependencies, and rebasing on master have brought about a new set of duplicate resources. I'm trying to switch to use non-shaded Alluxio libraries instead. |
@jkylling we should use shaded version if possible |
555645a
to
d66252e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can squash last two commits
@Override | ||
public Optional<String> getCacheKey(TrinoInputFile delegate) | ||
{ | ||
// TODO: Enable caching of parquet checkpoint files once the CheckpointEntryIterator is always closed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the TODO "CheckpointEntryIterator is always closed" still pending ? If it is, please include a link to an open GH issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be resolved by #20054 now. I'll enable the caching of checkpoint files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure we include some test with checkpoint files in delta if we're doing that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TestDeltaLakeAlluxioCacheFileOperations
does have coverage for the cache operations for checkpoint files, but it's unclear how much coverage on checkpoints we get through the TestDeltaLakeAlluxioCacheMinioAndHmsConnectorSmokeTest
test. It might be prudent to just leave it as is, and consider it as a future optimization. I'll keep the code as is, and leave a TODO to enable caching of checkpoint files.
Enabling caching of checkpoint files could make the delta.checkpoint-filtering.enabled=true
feature work better. When we tried enabling this feature earlier we saw a major slow down of the planning and analysis phases. The in-memory cache of checkpoints seemed to no longer be used, and lots of queries were made by the coordinator to object storage to fetch checkpoint files. However, if we only need to read the checkpoint files from disk it could work without the in-memory checkpoint cache.
...n/java/io/trino/tests/product/launcher/env/environment/EnvMultinodeMinioDataLakeCaching.java
Outdated
Show resolved
Hide resolved
...roduct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeAlluxioCaching.java
Outdated
Show resolved
Hide resolved
Ran SF1000 TPC benchmark on 6 r6gd.8xlarge workers Results on TPC look pretty good, there is significant reduction in wall time and some CPU time reduction. As a follow-up we should look at exposing bytes read from cache as a connector metric, this will make it easy to see usage of the cache for each table scan in a query in output of EXPLAIN ANALYZE VERBOSE, queryinfo json, event listener metrics etc. |
d66252e
to
fee6f07
Compare
I'm going to merge Rubix drop in a while (#20102), so we can rebase and drop the conflict resolution commit from this PR. |
db4944b
to
51bff46
Compare
@jkylling please rephrase Rubix commit. We've decided internally to remove Rubix when we implement Hive caching with Alluxio. We will merge this PR and add Hive support in a separate change |
Co-authored-by: Florent Delannoy <[email protected]>
Co-authored-by: Florent Delannoy <[email protected]>
51bff46
to
1b79597
Compare
Co-authored-by: Florent Delannoy <[email protected]>
Co-authored-by: Florent Delannoy <[email protected]>
Co-authored-by: Florent Delannoy <[email protected]>
1b79597
to
97e8df7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments, otherwise looks good
<groupId>org.alluxio</groupId> | ||
<artifactId>alluxio-core-common</artifactId> | ||
<version>${dep.alluxio.version}</version> | ||
<exclusions> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we file an issue to fix these in Alluxio?
DistributedQueryRunner queryRunner = DistributedQueryRunner.builder(session) | ||
.build(); | ||
try { | ||
File metastoreDirectory = queryRunner.getCoordinator().getBaseDataDir().resolve("delta_lake_metastore").toFile().getAbsoluteFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to migrate the testing code so that
- we create connectors using properties rather than hand wiring
- we use OpenTelemetry tracing rather than custom tracking code
For example, 458bfd7 replaced a custom metastore wrapper with a getSpans()
method on DistributedQueryrunner
(take a look at the assertMetastoreInvocationsForQuery()
utility method). If we can do a similar thing here, the test construction becomes simpler and easier to maintain, and we know that we're testing the same code that runs in production.
I'd like to remove TrackingFileSystemFactory
, so it's best if we don't introduce more usages of it.
@@ -249,12 +249,24 @@ | |||
<scope>runtime</scope> | |||
</dependency> | |||
|
|||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can convert TestDeltaLakeAlluxioCacheFileOperations
to construct the connector with properties and use tracing (per my other comment), then we should be able to remove these runtime dependencies.
|
||
import java.util.List; | ||
|
||
public class NoneCachingHostAddressProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NoCachingHostAddressProvider
would sound better
@@ -91,6 +94,8 @@ protected void setup(Binder binder) | |||
install(new GcsFileSystemModule()); | |||
factories.addBinding("gs").to(GcsFileSystemFactory.class); | |||
} | |||
|
|||
newOptionalBinder(binder, CachingHostAddressProvider.class).setDefault().to(NoneCachingHostAddressProvider.class).in(Scopes.SINGLETON); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using an optional binder here, the configured cache implementation should bind the implementation. So the NONE
cache should install NoCachingHostAddressProvider
and ALLUXIO
should install ConsistentHashingHostAddressProvider
.
public Optional<String> getCacheKey(TrinoInputFile delegate) | ||
throws IOException | ||
{ | ||
return Optional.of(delegate.location().path() + delegate.lastModified()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should use a separator that won't appear in file names. Otherwise, we could have a collision with a filename ending in a number.
public Optional<String> getCacheKey(TrinoInputFile delegate) | ||
{ | ||
// TODO: Consider caching of the Parquet checkpoint files within _delta_log | ||
if (!delegate.location().path().contains("/_delta_log/")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we skip caching this directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _delta_log
directory contains the files _last_checkpoint
and _trino_meta/extended_stats.json
. These are not immutable, so are tricky to cache. Also the commit files of the form 000...123.json
might not be immutable on ABFS. The checkpoint files should be immutable when accessed by Trino. We decided to leave it as a future optimization in #18719 (comment)
@colebow .. release notes entry should link to the docs and maybe just say Add support for filesystem caching Same for the other incoming PRs for Iceberg, Hive, and Hudi |
Description
👋
This PR includes Alluxio caching into Trino. It is a reworking of #16375 with:
optimized-local-scheduling
rather than introducing a new concept of node affinitylastModifiedTime
from the coordinator to the workers via ConnectorSplit to allow for immutable caching without workers having to maintain their own file status cacheTestDeltaLakeMinioAndHmsConnectorSmokeTest
intoTestCachingDeltaLakeMinioAndHmsConnectorSmokeTest
Additional context and related issues
This is very much #16375 from @beinan reworked after helpful feedback from @raunaqmorarka (thank you! 🙏 ).
I'm putting it out there so we can discuss next steps on integrating this to Trino 🥳 with the following notes:
optimized-local-scheduling
means each connector has to implement split scheduling by specifying anaddress
on the splits it generates. Otherwise, each split will randomly be assigned to a worker node, and the cache won't be distributed.lastModifiedTime
from the underlying file system, and use the cached file directly if availabletrino-delta-lake
, the connector I'm developing this for, but will need to be implemented separately for other connectors to take full advantage of AlluxioRelease notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: