Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let embedding model run on GPU #71

Open
ThiloteE opened this issue Jul 5, 2024 · 8 comments
Open

Let embedding model run on GPU #71

ThiloteE opened this issue Jul 5, 2024 · 8 comments
Labels
important-decision This issue contains an important architectural decision status: freeze Issues posponed to a (much) later future type: enhancement New feature or request
Milestone

Comments

@ThiloteE
Copy link
Collaborator

ThiloteE commented Jul 5, 2024

Historical "what the fuck" is available at JabRef#11430 (comment)

image

Advantages:

  • For LLMs, GPU is much faster than CPU. 10 times + X faster, depending on hardware.

Disadvantages:

  • Please correct me, if I am wrong, but I expect dependencies for GPU backend are required. (E.g. llama.cpp, Nvidia (drivers, Cuda toolkit libraries), Vulkan, RoCm, SYCL, ...)
@ThiloteE ThiloteE added type: enhancement New feature or request important-decision This issue contains an important architectural decision labels Jul 5, 2024
@ThiloteE
Copy link
Collaborator Author

ThiloteE commented Jul 5, 2024

If implemented, let users choose backend and hardware (CPU vs GPU / GPU1 or GPU2 or GPU3 ...) choose in preferences.

@InAnYan
Copy link
Owner

InAnYan commented Jul 17, 2024

Currently in langchain4j in-process embedding models (meaning they run locally on a computer) are run only on CPU. There is an issue to run embedding models on GPU, but it's not resolved.

In order to implement this we have these choices:

  1. Wait for implementation in langchain4j: simpler to develop, better from architectural point of view.
  2. Write fix ourselves for langchain4j: good.
  3. Use external modules and write all the support code ourselves in JabRef: the fastest way

It's a very good idea, we should look into it, but probably a bit later, when we finally release AI chat and, maybe, add summarization.

I'll mark the issue as low-priority, but it's only low priority for this context: week 1 and first release

@InAnYan InAnYan added Priority: low This issue is not very important (right now) and removed Priority: low This issue is not very important (right now) labels Jul 17, 2024
@InAnYan
Copy link
Owner

InAnYan commented Jul 17, 2024

Actually, no, I'll remove low-priority, and won't assign a milestone

@ThiloteE ThiloteE added the status: freeze Issues posponed to a (much) later future label Jul 17, 2024
@koppor koppor added this to the Week 12 milestone Aug 2, 2024
@koppor
Copy link
Collaborator

koppor commented Aug 2, 2024

I collect it at the final "anything else" Milestone "final polishing" 😅

@ThiloteE
Copy link
Collaborator Author

ThiloteE commented Aug 2, 2024

@ThiloteE
Copy link
Collaborator Author

ThiloteE commented Aug 7, 2024

GPU support with Deep Java library: https://docs.djl.ai/engines/onnxruntime/onnxruntime-engine/index.html#install-gpu-package. Unfortunately they also use Microsofts ONNX, which seems to be very slow. I assume models need to be compatible with ONNX too, because not many models are uploaded on Huggingface in ONNX file format!

@koppor
Copy link
Collaborator

koppor commented Aug 29, 2024

At least, one can paint everything blue in the CPU utilization

image

@ThiloteE
Copy link
Collaborator Author

ThiloteE commented Oct 1, 2024

One solution to providing support for GPU acceleration for LLMs (NOT necessarily embedding models!) is to provide proper support for OpenAI API. See issue JabRef#11872. Using external applications like llama.cpp, GPT4All, LMStudio, Ollama, Jan, KobolCPP etc. that already provide support for GPU acceleration, there is no need to add and maintain this feature in JabRef. It would still be nice to have GPU acceleration for embedding models though. Maybe do it like Koboldcpp and only provide a Vulkan backend, which is much much smaller than a Cuda backend (~1.5 GB in pytorch; 200 - 500 MB in llama.cpp).

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
important-decision This issue contains an important architectural decision status: freeze Issues posponed to a (much) later future type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants