GPU Configuration, Concurrent Request Handling & RoPE scaling #495

houmie · 2024-06-01T09:23:45Z

houmie
Jun 1, 2024

We have conducted initial tests on the Aphrodite-engine and are impressed with the results. We are now considering replacing vLLM with Aphrodite-engine for production. However, I have a few questions:

We plan to run it on RunPod using this template. To utilise four GPUs, is it sufficient to set NUM_GPUS to 4? We were planning to use turboderp/Llama-3-70B-Instruct-exl2 with 6.0bpw quantisation and were hoping to deploy it on 4 x 16 GB VRAM GPUs, totalling 64 GB. Is this configuration possible, and is exl2 supported across 4 GPUs?
Regarding how Aphrodite-engine manages concurrent API requests: Does it batch and process them sequentially, or does it handle them concurrently in parallel? I read on Reddit that it supports concurrent processing but at a reduced quality of 30%-40% when asynchronous-concurrent generation is enabled. In our case, maintaining high response quality is critical. Is there an option to queue the requests to run sequentially or with reduced concurrency to avoid quality degradation?
Is it possible to increase the context length on Llama-3 from 8192 to 9728 by setting CONTEXT_LENGTH to 9728? Aphrodite apparently supports automatic RoPE scaling. Would this adjustment negatively impact the quality of responses?

Many thanks.

sgsdxzy · 2024-06-01T12:53:01Z

sgsdxzy
Jun 1, 2024
Collaborator

Yes
I don't know where the reduction in quality is from.
Yes and yes, it uses dynamic rope scaling automatically.

2 replies

houmie Jun 1, 2024
Author

Thank you for your answers.

Regarding question 2: Is there anything specific I need to do to enable asynchronous-concurrent generation? I couldn’t find any details in the wiki. Also, if I disable it, what happens when a second request comes through—will it time out, or will it wait in a queue?

For question 3: Could you elaborate on how significant the negative impact might be? Would a slight increase have a minimal effect, whereas a larger increase, say 50%, cause a more substantial degradation in quality?

AlpinDale Jun 1, 2024
Maintainer

Hi. The API server uses the AsyncAphrodite class, which is an async wrapper over the main engine. The scheduler, worker, and async engine perform automatic, continuous batching. The user only needs to set a maximum batch size, and the engine will try and accommodate that; it will handle queuing, scheduling, swapping, etc. You can set the max batch size with --max-num-seqs in the CLI. For the runpod template, you may need to add it as a CMD_ADDITIONAL_ARG flag: Key: CMD_ADDITIONAL_ARGS | Value: --max-num-seqs 1024. It's set to 256 by default. I would recommend not using the runpod image as it's not very flexible for non-casual uses. You can simply create a new PyTorch-based instance on runpod and install aphrodite as normal via pip.

For question 3: Could you elaborate on how significant the negative impact might be? Would a slight increase have a minimal effect, whereas a larger increase, say 50%, cause a more substantial degradation in quality?

Up to 2x the original context length, there's practically no degradation - although it depends on the model. Some models, like Llama-3, can extend up to 32k with little issues.

houmie · 2024-06-01T14:21:54Z

houmie
Jun 1, 2024
Author

Regarding question 1: Although everything appears to be loaded correctly, it still doesn't work. It threw an error. There seem to be a timeout issue. What am I missing please?

2024-06-01T14:05:04.194196565Z Starting Aphrodite Engine API server...
2024-06-01T14:05:04.194650825Z + exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 7860 --download-dir /tmp/hub --model turboderp/Llama-3-70B-Instruct-exl2 --revision 6.0bpw --kv-cache-dtype fp8 --tensor-parallel-size 4 --gpu-memory-utilization 0.99 --quantization exl2 --max-log-len 0
2024-06-01T14:05:05.116968616Z Starting Aphrodite Engine API server...
2024-06-01T14:05:05.117393792Z + exec python3 -m aphrodite.endpoints.openai.api_server --host 0.0.0.0 --port 7860 --download-dir /tmp/hub --model turboderp/Llama-3-70B-Instruct-exl2 --revision 6.0bpw --kv-cache-dtype fp8 --tensor-parallel-size 4 --gpu-memory-utilization 0.99 --quantization exl2 --max-log-len 0
2024-06-01T14:05:09.838117674Z /usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
2024-06-01T14:05:09.838148036Z   warnings.warn(
2024-06-01T14:05:10.030259717Z WARNING:  exl2 quantization is not fully optimized yet. The speed can be slower 
2024-06-01T14:05:10.030301825Z than non-quantized models.
2024-06-01T14:05:10.036862847Z INFO:     Using fp8 data type to store kv cache. It reduces the GPU memory 
2024-06-01T14:05:10.036884661Z footprint and boosts the performance. But it may cause slight accuracy drop 
2024-06-01T14:05:10.036889794Z without scaling factors. FP8_E5M2 (without scaling) is only supported on cuda 
2024-06-01T14:05:10.036894148Z version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is instead supported for 
2024-06-01T14:05:10.036898568Z common inference criteria.
2024-06-01T14:05:12.188163487Z 2024-06-01 14:05:12,187	WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-06-01T14:05:12.188320935Z 2024-06-01 14:05:12,188	WARNING utils.py:592 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 20.4 to 20.
2024-06-01T14:05:13.434012058Z 2024-06-01 14:05:13,433	INFO worker.py:1749 -- Started a local Ray instance.
2024-06-01T14:05:14.371092470Z INFO:     Initializing the Aphrodite Engine (v0.5.3) with the following config:
2024-06-01T14:05:14.371119054Z INFO:     Model = 'turboderp/Llama-3-70B-Instruct-exl2'
2024-06-01T14:05:14.371124315Z INFO:     Speculative Config = None
2024-06-01T14:05:14.371129162Z INFO:     DataType = torch.bfloat16
2024-06-01T14:05:14.371133599Z INFO:     Model Load Format = auto
2024-06-01T14:05:14.371138045Z INFO:     Number of GPUs = 4
2024-06-01T14:05:14.371142859Z INFO:     Disable Custom All-Reduce = False
2024-06-01T14:05:14.371147562Z INFO:     Quantization Format = exl2
2024-06-01T14:05:14.371151749Z INFO:     Context Length = 8192
2024-06-01T14:05:14.371156050Z INFO:     Enforce Eager Mode = True
2024-06-01T14:05:14.371160183Z INFO:     KV Cache Data Type = fp8
2024-06-01T14:05:14.371164333Z INFO:     KV Cache Params Path = None
2024-06-01T14:05:14.371168720Z INFO:     Device = cuda
2024-06-01T14:05:14.371173810Z INFO:     Guided Decoding Backend = 
2024-06-01T14:05:14.371178167Z DecodingConfig(guided_decoding_backend='outlines')
2024-06-01T14:05:15.385708705Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-06-01T14:05:30.914186718Z INFO:     Using FlashAttention backend.
2024-06-01T14:05:32.168657448Z �[36m(RayWorkerAphrodite pid=1559)�[0m INFO:     Using FlashAttention backend.
2024-06-01T14:05:32.168674709Z INFO:     Aphrodite is using nccl==2.20.5
2024-06-01T14:05:32.679962980Z �[36m(RayWorkerAphrodite pid=1559)�[0m INFO:     Aphrodite is using nccl==2.20.5
2024-06-01T14:05:32.679985121Z INFO:     NVLink detection failed with message "Not Supported". This is normal 
2024-06-01T14:05:32.679989928Z if your machine has no NVLink equipped
2024-06-01T14:05:32.681748968Z WARNING:  Custom allreduce is disabled because it's not supported on more than 
2024-06-01T14:05:32.681763136Z two PCIe-only GPUs. To silence this warning, specify 
2024-06-01T14:05:32.681768146Z disable_custom_all_reduce=True explicitly.
2024-06-01T14:05:32.691257975Z WARNING:  Model is quantized. Forcing float16 datatype.
2024-06-01T14:05:32.826229518Z �[36m(RayWorkerAphrodite pid=1559)�[0m INFO:     NVLink detection failed with message "Not Supported". This is normal 
2024-06-01T14:05:32.826238639Z �[36m(RayWorkerAphrodite pid=1559)�[0m if your machine has no NVLink equipped
2024-06-01T14:05:32.826242377Z �[36m(RayWorkerAphrodite pid=1559)�[0m WARNING:  Custom allreduce is disabled because it's not supported on more than 
2024-06-01T14:05:32.826245896Z �[36m(RayWorkerAphrodite pid=1559)�[0m two PCIe-only GPUs. To silence this warning, specify 
2024-06-01T14:05:32.826249318Z �[36m(RayWorkerAphrodite pid=1559)�[0m disable_custom_all_reduce=True explicitly.
2024-06-01T14:05:32.826252935Z �[36m(RayWorkerAphrodite pid=1559)�[0m WARNING:  Model is quantized. Forcing float16 datatype.
2024-06-01T14:05:32.826256292Z INFO:     Using FlashAttention backend.
2024-06-01T14:05:33.365561702Z INFO:     Using model weights format ['*.safetensors']
2024-06-01T14:10:41.456113085Z �[36m(RayWorkerAphrodite pid=1559)�[0m INFO:     Using model weights format ['*.safetensors']
2024-06-01T14:10:41.456139340Z INFO:     Model weights loaded. Memory usage: 12.69 GiB x 4 = 50.77 GiB
2024-06-01T14:21:05.023084855Z �[36m(RayWorkerAphrodite pid=1630)�[0m [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600100 milliseconds before timing out.
2024-06-01T14:21:05.023120613Z �[36m(RayWorkerAphrodite pid=1630)�[0m [rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 2] Timeout at NCCL work: 5, last enqueued NCCL work: 5, last completed NCCL work: 4.
2024-06-01T14:21:05.023124837Z �[36m(RayWorkerAphrodite pid=1630)�[0m [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
2024-06-01T14:21:05.023127599Z �[36m(RayWorkerAphrodite pid=1630)�[0m [rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
2024-06-01T14:21:05.023131512Z �[36m(RayWorkerAphrodite pid=1630)�[0m [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600100 milliseconds before timing out.
2024-06-01T14:21:05.023135217Z �[36m(RayWorkerAphrodite pid=1630)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
2024-06-01T14:21:05.023137941Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f24d817a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-06-01T14:21:05.023140974Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1a699801b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-06-01T14:21:05.023149618Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1a69984fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-06-01T14:21:05.023152358Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1a6998631c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-06-01T14:21:05.023155038Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #4: <unknown function> + 0xdc253 (0x7f24e59e5253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-06-01T14:21:05.023171864Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #5: <unknown function> + 0x94ac3 (0x7f24e7857ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-06-01T14:21:05.023193512Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #6: clone + 0x44 (0x7f24e78e8a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-06-01T14:21:05.023198342Z �[36m(RayWorkerAphrodite pid=1630)�[0m 
2024-06-01T14:21:05.023248084Z �[36m(RayWorkerAphrodite pid=1630)�[0m [2024-06-01 14:21:04,999 E 1630 1831] logging.cc:101: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=67108864, NumelOut=67108864, Timeout(ms)=600000) ran for 600100 milliseconds before timing out.
2024-06-01T14:21:05.023266208Z �[36m(RayWorkerAphrodite pid=1630)�[0m Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
2024-06-01T14:21:05.023312101Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f24d817a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-06-01T14:21:05.023331028Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1a699801b2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-06-01T14:21:05.023342751Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f1a69984fd0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-06-01T14:21:05.023363164Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f1a6998631c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-06-01T14:21:05.023368002Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #4: <unknown function> + 0xdc253 (0x7f24e59e5253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-06-01T14:21:05.023401007Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #5: <unknown function> + 0x94ac3 (0x7f24e7857ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-06-01T14:21:05.023405956Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #6: clone + 0x44 (0x7f24e78e8a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-06-01T14:21:05.023455208Z �[36m(RayWorkerAphrodite pid=1630)�[0m 
2024-06-01T14:21:05.023462391Z �[36m(RayWorkerAphrodite pid=1630)�[0m Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
2024-06-01T14:21:05.023468615Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f24d817a897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
2024-06-01T14:21:05.023509388Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #1: <unknown function> + 0xe32e33 (0x7f1a69608e33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
2024-06-01T14:21:05.023523980Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #2: <unknown function> + 0xdc253 (0x7f24e59e5253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
2024-06-01T14:21:05.023530109Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #3: <unknown function> + 0x94ac3 (0x7f24e7857ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-06-01T14:21:05.023536247Z �[36m(RayWorkerAphrodite pid=1630)�[0m frame #4: clone + 0x44 (0x7f24e78e8a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
2024-06-01T14:21:05.023567013Z �[36m(RayWorkerAphrodite pid=1630)�[0m 
2024-06-01T14:21:05.129063752Z �[36m(RayWorkerAphrodite pid=1630)�[0m [2024-06-01 14:21:05,024 E 1630 1831] logging.cc:108: Stack trace: 
2024-06-01T14:21:05.129091274Z �[36m(RayWorkerAphrodite pid=1630)�[0m  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x100fc3a) [0x7f24e6b44c3a] ray::operator<<()
2024-06-01T14:21:05.129097137Z �[36m(RayWorkerAphrodite pid=1630)�[0m /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10126f8) [0x7f24e6b476f8] ray::TerminateHandler()
2024-06-01T14:21:05.129119068Z �[36m(RayWorkerAphrodite pid=1630)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f24e59b720c]
2024-06-01T14:21:05.129151573Z �[36m(RayWorkerAphrodite pid=1630)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f24e59b7277]
2024-06-01T14:21:05.129157066Z �[36m(RayWorkerAphrodite pid=1630)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f24e59b71fe]
2024-06-01T14:21:05.129161510Z �[36m(RayWorkerAphrodite pid=1630)�[0m /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe32ee4) [0x7f1a69608ee4] c10d::ProcessGroupNCCL::ncclCommWatchdog()
2024-06-01T14:21:05.129240760Z �[36m(RayWorkerAphrodite pid=1630)�[0m /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f24e59e5253]
2024-06-01T14:21:05.129256858Z �[36m(RayWorkerAphrodite pid=1630)�[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f24e7857ac3]
2024-06-01T14:21:05.129302240Z �[36m(RayWorkerAphrodite pid=1630)�[0m /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f24e78e8a04] __clone
2024-06-01T14:21:05.129317323Z �[36m(RayWorkerAphrodite pid=1630)�[0m 
2024-06-01T14:21:05.129349645Z �[36m(RayWorkerAphrodite pid=1630)�[0m *** SIGABRT received at time=1717251665 on cpu 23 ***
2024-06-01T14:21:05.129383170Z �[36m(RayWorkerAphrodite pid=1630)�[0m PC: @     0x7f24e78599fc  (unknown)  pthread_kill
2024-06-01T14:21:05.129430442Z �[36m(RayWorkerAphrodite pid=1630)�[0m     @     0x7f24e7805520  (unknown)  (unknown)
2024-06-01T14:21:05.129438292Z �[36m(RayWorkerAphrodite pid=1630)�[0m [2024-06-01 14:21:05,024 E 1630 1831] logging.cc:365: *** SIGABRT received at time=1717251665 on cpu 23 ***
2024-06-01T14:21:05.129485028Z �[36m(RayWorkerAphrodite pid=1630)�[0m [2024-06-01 14:21:05,024 E 1630 1831] logging.cc:365: PC: @     0x7f24e78599fc  (unknown)  pthread_kill
2024-06-01T14:21:05.129515825Z �[36m(RayWorkerAphrodite pid=1630)�[0m [2024-06-01 14:21:05,025 E 1630 1831] logging.cc:365:     @     0x7f24e7805520  (unknown)  (unknown)
2024-06-01T14:21:05.129546510Z �[36m(RayWorkerAphrodite pid=1630)�[0m Fatal Python error: Aborted
2024-06-01T14:21:05.129576182Z �[36m(RayWorkerAphrodite pid=1630)�[0m 
2024-06-01T14:21:05.129617163Z �[36m(RayWorkerAphrodite pid=1630)�[0m 
2024-06-01T14:21:05.129668852Z �[36m(RayWorkerAphrodite pid=1630)�[0m Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, sentencepiece._sentencepiece, scipy._lib._ccallback_c, PIL._imaging (total: 32)

2 replies

AlpinDale Jun 1, 2024
Maintainer

I can't seem to reproduce this on 4x A40s. What GPUs were you using?

houmie Jun 1, 2024
Author

I can't seem to reproduce this on 4x A40s. What GPUs were you using?

It’s strange. I was using 4 x RTX A4000 on RunPod, which should have enough memory (64 GB vRAM) for turboderp/Llama-3-70B-Instruct-exl2 on 6.0bpw. Using 4 x A40 would be excessive.

Based on your earlier suggestion, should I switch to a 4 x RTX A4000 with this PyTorch-based instance (runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04) instead?
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Configuration, Concurrent Request Handling & RoPE scaling #495

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

GPU Configuration, Concurrent Request Handling & RoPE scaling #495

houmie Jun 1, 2024

Replies: 2 comments · 4 replies

sgsdxzy Jun 1, 2024 Collaborator

houmie Jun 1, 2024 Author

AlpinDale Jun 1, 2024 Maintainer

houmie Jun 1, 2024 Author

AlpinDale Jun 1, 2024 Maintainer

houmie Jun 1, 2024 Author

houmie
Jun 1, 2024

Replies: 2 comments 4 replies

sgsdxzy
Jun 1, 2024
Collaborator

houmie Jun 1, 2024
Author

AlpinDale Jun 1, 2024
Maintainer

houmie
Jun 1, 2024
Author

AlpinDale Jun 1, 2024
Maintainer

houmie Jun 1, 2024
Author