Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Initial Java Support for GDS to KvikIO #396

Open
wants to merge 33 commits into
base: branch-24.12
Choose a base branch
from

Conversation

aslobodaNV
Copy link

@aslobodaNV aslobodaNV commented Jun 25, 2024

This PR is intended to add initial support for Java binding to GDS as part of the KvikIO library. In this PR are the minimal set of bindings required to support synchronous read and write IO operations via GDS as well as a single example to demonstrate how the bindings can be used alongside other CUDA libraries, such as JCuda. Full support for the GDS CuFile API, including batch and asynchronous IO, has not yet been implemented and more sophisticated error/exception handling is not yet in place. There is a README located within kvikio/java detailing how this new functionality can be compiled and built locally, along with detailed instructions on how to run the included usage example.

Copy link

copy-pr-bot bot commented Jun 25, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@aslobodaNV aslobodaNV marked this pull request as draft June 25, 2024 20:48
@madsbk madsbk added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jun 26, 2024
java/README.md Outdated
Comment on lines 1 to 13
Java Bindings

Summary
These Java KvikIO bindings for GDS currently support only synchronous read and write IO operations using the underlying CuFile API. Support for batch IO and asynchronous operations are not yet supported.

Dependencies
The Java KvikIO bindings have been developed to work on Linux based systems and require CUDA to be installed and for GDS to be properly enabled. Instructions for how to install and enable GDS can be found on NVIDIA's website. To compile the shared library it is also necessary to have a JDK installed. To run the included example, it is also necessary to install JCuda as it is used to handle memory allocations and the transfer of data between host and GPU memory. JCuda jar files supporting CUDA 12.x can be found here:
https://repo1.maven.org/maven2/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar
https://repo1.maven.org/maven2/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar

Compilation
To recompile the .so file for your local system run the following command. Note: Update the command to reflect the directory where you have installed CUDA and your JDK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use markdown formatting here, so that this will look a bit better when rendered.

Suggested change
Java Bindings
Summary
These Java KvikIO bindings for GDS currently support only synchronous read and write IO operations using the underlying CuFile API. Support for batch IO and asynchronous operations are not yet supported.
Dependencies
The Java KvikIO bindings have been developed to work on Linux based systems and require CUDA to be installed and for GDS to be properly enabled. Instructions for how to install and enable GDS can be found on NVIDIA's website. To compile the shared library it is also necessary to have a JDK installed. To run the included example, it is also necessary to install JCuda as it is used to handle memory allocations and the transfer of data between host and GPU memory. JCuda jar files supporting CUDA 12.x can be found here:
https://repo1.maven.org/maven2/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar
https://repo1.maven.org/maven2/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar
Compilation
To recompile the .so file for your local system run the following command. Note: Update the command to reflect the directory where you have installed CUDA and your JDK.
# Java Bindings
## Summary
These Java KvikIO bindings for GDS currently support only synchronous read and write IO operations using the underlying CuFile API. Support for batch IO and asynchronous operations are not yet supported.
## Dependencies
The Java KvikIO bindings have been developed to work on Linux based systems and require CUDA to be installed and for GDS to be properly enabled. Instructions for how to install and enable GDS can be found on NVIDIA's website. To compile the shared library it is also necessary to have a JDK installed. To run the included example, it is also necessary to install JCuda as it is used to handle memory allocations and the transfer of data between host and GPU memory. JCuda jar files supporting CUDA 12.x can be found here:
https://repo1.maven.org/maven2/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar
https://repo1.maven.org/maven2/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar
## Compilation
To recompile the .so file for your local system run the following command. Note: Update the command to reflect the directory where you have installed CUDA and your JDK.

(See https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax for further syntax).

java/README.md Outdated
Comment on lines 8 to 9
https://repo1.maven.org/maven2/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar
https://repo1.maven.org/maven2/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Prefer not to link to explicit files since this can easily go out of date without us remembering to update things. For preference, can we link to an appropriate installation document for JCuda?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I can link to the installation instructions as well but found that the installation instructions themselves are a bit unclear/potentially out of date. I may keep these links but just add a note to confirm they are still up to date and reference the installation instructions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused by these bindings. They look to be bindings to the cufile API, and not the kvikio API. Is that deliberate? It seems like if so one will have to replicate a lot of the implementation that already exists in kvikio here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, these are directly bound to the underlying cufile API. In the initial discussions with the rest of the GDS team there was some concern about potential performance overhead of Java bindings -> C++ bindings -> C API. I have not investigated the overhead yet, but definitely see the benefit you are suggesting to sharing one underlying set of bindings. If you don't mind, I will add an issue to my backlog to investigate the potential performance differences and, if they are minimal, update these to point to the C++ bindings at a later date.

Copy link
Member

@madsbk madsbk Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, let's continue with raw java bindings for now.

Side note, I would be very surprised if the basic C++ API introduce any significant overhead. E.g. the overhead of calling read() vs. calling cuFileRead() is tiny.

@madsbk
Copy link
Member

madsbk commented Jun 27, 2024

/ok to test

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few high level comments before digging too deep into this code:

  • I have a weak preference to wrap the C++ bindings rather than re-implement kvikio in Java. This will probably be easier than reimplementing the code as new features are introduced to kvikio, and gives more confidence in testing (bindings are probably less bug-prone than a full reimplementation).
  • This will need to pass style checks. See the cuDF guide on Code Formatting. The same guidance applies to using pre-commit with kvikio development.
  • This will need a build system, probably Maven. Long commands in the README that explain how to supply all the dependencies are not a good solution if those dependencies can be fetched and built automatically. See how cuDF's Java bindings use Maven, for instance. https://github.com/rapidsai/cudf/tree/branch-24.08/java
    • The end result should be that commands like mvn clean install work properly.
  • This will need CI builds and tests. Review the following resources on cuDF's Java CI and try to duplicate them:

Give this a start and please feel free to schedule time with me to discuss anything that is unclear!

@jakirkham
Copy link
Member

Is this still planned for 24.08 or should we move to 24.10?

@aslobodaNV
Copy link
Author

I will not likely finish the updates here for a couple weeks due to other priorities, so this should be moved to 24.10

@jakirkham jakirkham changed the base branch from branch-24.08 to branch-24.10 July 24, 2024 20:27
@jakirkham
Copy link
Member

Ok thanks Alex! 🙏

Have moved to 24.10

@aslobodaNV aslobodaNV marked this pull request as ready for review September 4, 2024 21:21
@aslobodaNV aslobodaNV requested review from a team as code owners September 4, 2024 21:21
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick first pass of review -- mostly focused on packaging and CI. Give my suggestions a try and I'll comment /ok to test to run your next commits on CI.

.github/workflows/build.yaml Outdated Show resolved Hide resolved
.github/workflows/pr.yaml Outdated Show resolved Hide resolved
.github/workflows/test.yaml Outdated Show resolved Hide resolved
java/README.md Outdated
#### Run example

cd kvikio/java/
java -cp target/cufile-24.10.0-SNAPSHOT.jar:$HOME/.m2/repository/org/jcuda/jcuda/12.0.0/jcuda-12.0.0.jar:$HOME/.m2/repository/org/jcuda/jcuda-natives/12.0.0/jcuda-natives-12.0.0.jar -Djava.library.path=./target bindings.kvikio.example.Main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cufile-24.10.0-SNAPSHOT.jar

cuFile is not versioned like RAPIDS. Is this correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the version, the jar that is generated is not aware of what version of libcufile was used to generate it... that just depends on what version of the cuda toolkit the host machine has installed. It appears I can't make the jar not have a version number at all, so I'm inclined to have the version represent the version of kvikio. If you think there's a way I can inject the version of libcufile into maven at runtime I am open to that, but I have not found any information on how that would be done so far.

@bdice
Copy link
Contributor

bdice commented Oct 1, 2024

/ok to test

1 similar comment
@bdice
Copy link
Contributor

bdice commented Oct 1, 2024

/ok to test

java/pom.xml Outdated
<configuration>
<target>
<!-- Compile native code using nvcc -->
<exec executable="/usr/local/cuda/bin/nvcc">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aslobodaNV The first real failure: this path is not the correct path to nvcc in the CI environment, which installed cuda-nvcc with conda. Can you just call nvcc?

Suggested change
<exec executable="/usr/local/cuda/bin/nvcc">
<exec executable="nvcc">

Error log here:

Error:  Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:3.0.0:run (compile-native-code) on project cufile: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "/usr/local/cuda/bin/nvcc" (in directory "/__w/kvikio/kvikio/java"): error=2, No such file or directory
Error:  around Ant part ...<exec executable="/usr/local/cuda/bin/nvcc">... @ 4:49 in /__w/kvikio/kvikio/java/target/antrun/build-main.xml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, the include paths should be located in the conda environment. The CUDA packages use a special "targets" layout. For example, libcufile-dev will install cufile.h into ${CONDA_PREFIX}/targets/x86_64-linux/include/cufile.h.

The openjdk package installs some headers into ${CONDA_PREFIX}/include and some into ${CONDA_PREFIX}/include/linux.

CMake knows how to detect and parse this layout of the CUDA Toolkit but I do not know how Maven or other build systems are supposed to consume it. I would ask the Spark team, perhaps they know: @revans2, do you have insight here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally fine with committing the update to just call nvcc instead of the path, I just didn't happen to have nvcc on my path when I wrote that. Let me look into the include paths and reach out to Bobby if I get stuck.

- output_types: conda
packages:
- maven
- openjdk=8.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use openjdk=8.*? It appears that the latest is openjdk=22.*.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, I just happened to have 8 installed and used that. I will update to 22

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDK8 is a very much a least common denominator. It is no longer supported without paying a lot of money to oracle though.

https://www.oracle.com/java/technologies/java-se-support-roadmap.html

and oracle is trying very hard to push people away from older versions of java.

java 11, 17, and 21 are long term support versions that are currently GA.

Java generally has really good backwards compatibility so almost anything compiled with JDK8 can run on anything newer. But newer code is not backwards compatible with older runtimes. But you can ask newer compilers to output binary code that is compatible with older JDK versions. That is what we do with the Spark plugin.

We also support whatever Spark supports for their default binary release. So for older versions of Spark it is JDK8. For some of the latest versions that is JDK17, but because of how we ask the compiler to output older binary versions it really becomes a requirement that they have at least that version of java installed.

I wouldn't update to jdk22 unless you also decide which version of java you want to be minimally compatible with and ask the compiler to output compatible .class files.

Also just another point to think about jcuda still uses openjdk8. This is probably to have as much compatibility as possible. https://github.com/jcuda/jcuda-main/blob/master/BUILDING.md

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that context, that makes a lot of sense. I have not been following the broader Java versioning ecosystem so was unaware of much of this. I think it is likely we will want to pick one of the versions with long term support, I would guess 17 or 21, but will think on it some more before making an update.

java/pom.xml Outdated
<configuration>
<target>
<!-- Compile native code using nvcc -->
<exec executable="/usr/local/cuda/bin/nvcc">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, the include paths should be located in the conda environment. The CUDA packages use a special "targets" layout. For example, libcufile-dev will install cufile.h into ${CONDA_PREFIX}/targets/x86_64-linux/include/cufile.h.

The openjdk package installs some headers into ${CONDA_PREFIX}/include and some into ${CONDA_PREFIX}/include/linux.

CMake knows how to detect and parse this layout of the CUDA Toolkit but I do not know how Maven or other build systems are supposed to consume it. I would ask the Spark team, perhaps they know: @revans2, do you have insight here?

Co-authored-by: Bradley Dice <[email protected]>
Copy link

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good. Just a few nits. Mostly around java convention. I am curious about your plans to distribute binaries. The java code itself is going to require the native library to work, but the native library is something that the user is expected to compile themselves. This is not a great user experience, but it is a real pain to try and distribute native binaries too. If you want/need some help with this there are some people on my team that have spent a lot of time to make this work well, especially for x86 linux.

java/pom.xml Show resolved Hide resolved
java/pom.xml Outdated Show resolved Hide resolved
java/src/main/java/bindings/kvikio/cufile/CuFile.java Outdated Show resolved Hide resolved
driver.close();
}));
initialized = true;
} catch (Throwable t) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: catching Throwable is not ideal. It can include all kinds of unrecoverable errors. At a minimum you probably want to print out why this could not be initialized in the error message. That way you can debug that a dependency was missing/etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to point me to an example of best practices in this area? Happy to change it further from my latest update.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked it as a nit because it is probably fine in this case. In fact when I went back to look at the CUDF code, I did the exact same thing 5 years ago.

https://github.com/rapidsai/cudf/blob/4dbb8a354a9d4f0b4d82a5bf9747409c6304358f/java/src/main/java/ai/rapids/cudf/NativeDepsLoader.java#L74-L83

It is just generally not good practice to catch a Throwable, because Throwable includes all Errors and Errors are generally considered to not be recoverable. But really the main thing here is giving the user enough information that they can debug why it is failing. You are printing the error message, but the full stack trace would be better. That is the main thing that I would suggest you do.


package bindings.kvikio.cufile;

abstract class CuFileHandle implements AutoCloseable {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to have some javadocs here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be sure to get these javadocs that you have requested in before we merge, just wanted to tackle the build issues we were seeing first. Thank you for all your feedback.


package bindings.kvikio.cufile;

public final class CuFileReadHandle extends CuFileHandle {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadocs would be very nice for this class as it is intended to be used by developers.


package bindings.kvikio.cufile;

public final class CuFileWriteHandle extends CuFileHandle {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here javadocs would be very nice for the developer experience.

@revans2
Copy link

revans2 commented Oct 7, 2024

I forgot to add that the APIs to take a raw pointer is far from ideal. It might be nice to have a separate library that is compiled against jcuda and the code kvikio jar so that you can hide the scary part of pulling the native address out of a Pointer and provide a cleaner/safer API.

@aslobodaNV
Copy link
Author

@madsbk or @bdice would one of you be able to try and run this through the build tests? I have updated it to try and make it more agnostic to the build location. It works on my machine, but need to know if there are more build errors remaining to be fixed for the CI/CD machines.

Copy link

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it looks really good. I am not going to approve it because I don't know enough about cufile to be able to say that this matches all of the requirements. But from a java perspective it looks decent.

- output_types: conda
packages:
- maven
- openjdk=8.*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JDK8 is a very much a least common denominator. It is no longer supported without paying a lot of money to oracle though.

https://www.oracle.com/java/technologies/java-se-support-roadmap.html

and oracle is trying very hard to push people away from older versions of java.

java 11, 17, and 21 are long term support versions that are currently GA.

Java generally has really good backwards compatibility so almost anything compiled with JDK8 can run on anything newer. But newer code is not backwards compatible with older runtimes. But you can ask newer compilers to output binary code that is compatible with older JDK versions. That is what we do with the Spark plugin.

We also support whatever Spark supports for their default binary release. So for older versions of Spark it is JDK8. For some of the latest versions that is JDK17, but because of how we ask the compiler to output older binary versions it really becomes a requirement that they have at least that version of java installed.

I wouldn't update to jdk22 unless you also decide which version of java you want to be minimally compatible with and ask the compiler to output compatible .class files.

Also just another point to think about jcuda still uses openjdk8. This is probably to have as much compatibility as possible. https://github.com/jcuda/jcuda-main/blob/master/BUILDING.md

java/pom.xml Outdated
<version>3.8.0</version>
<configuration>
<source>21</source>
<target>21</target>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where you can set what the output target is that you want.

@@ -0,0 +1,21 @@
cmake_minimum_required(VERSION 3.10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAPIDS requires a minimum of cmake 3.26.4 right now, I think. Can we set a higher minimum for these bindings? I don't think we will ever test with a CMake version as old as 3.10.

@bdice
Copy link
Contributor

bdice commented Oct 18, 2024

/ok to test

@bdice
Copy link
Contributor

bdice commented Oct 18, 2024

/ok to test

@bdice
Copy link
Contributor

bdice commented Oct 18, 2024

/ok to test

@bdice
Copy link
Contributor

bdice commented Oct 18, 2024

/ok to test

Copy link

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the java side of things it looks good. I am not an expert on the APIs or all of the use cases, but the plumbing looks decent.

@bdice
Copy link
Contributor

bdice commented Oct 22, 2024

@aslobodaNV Looks like there are CI build failures. I'm not sure how to diagnose this.

Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project cufile: Fatal error compiling: error: invalid target release: 21 -> [Help 1]

java/pom.xml Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor

bdice commented Oct 22, 2024

/ok to test

@bdice
Copy link
Contributor

bdice commented Nov 13, 2024

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants