Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to connect to vineyard via both IPC and RPC connection #696

Open
Jeffwan opened this issue Feb 18, 2025 · 3 comments
Open

Failed to connect to vineyard via both IPC and RPC connection #696

Jeffwan opened this issue Feb 18, 2025 · 3 comments
Assignees
Labels
area/kv-cache kind/bug Something isn't working priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 18, 2025

🐛 Describe the bug


INFO 02-17 17:03:44 model_runner.py:1041] Loading model weights took 12.5708 GB
INFO 02-17 17:03:44 vineyard_llm_cache.py:296] VineyardLLMCache async update: {'enable_async_update': True, 'min_inflight_tasks': 1, 'max_inflight_tasks': 8}
INFO 02-17 17:03:44 vineyard_llm_cache.py:306] VineyardLLMCache from_envs None
No RDMA endpoint provided. Fall back to TCP.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 10 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 9 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 8 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 7 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 6 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 5 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 4 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 3 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 2 more times.
[info] Connection to RPC socket failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600 with ret = IOError: getaddrinfo() failed for endpoint aibrix-model-deepseek-coder-7b-kvcache-rpc:9600, retrying 1 more times.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 326, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 41, in _init_executor
    self.driver_worker.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 184, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1104, in load_model
    self._init_vineyard_cache(self.cache_service_metrics)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1010, in _init_vineyard_cache
    self.vineyard_llm_cache: VineyardLLMCache = VineyardLLMCache.from_envs(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 307, in from_envs
    return VineyardLLMCache(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/vineyard_llm_cache.py", line 136, in __init__
    self.cache = VineyardKVCache(
  File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 380, in __init__
    cache_config = AIBrixCacheConfig(**config)
  File "/usr/local/lib/python3.10/dist-packages/vineyard/llm/cache.py", line 257, in __init__
    self.rpc_client = vineyard.connect(
  File "/usr/local/lib/python3.10/dist-packages/vineyard/__init__.py", line 418, in connect
    return Client(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vineyard/core/client.py", line 296, in __init__
    raise ConnectionError(
ConnectionError: Failed to connect to vineyard via both IPC and RPC connection. Arguments, environment variables `VINEYARD_IPC_SOCKET` and `VINEYARD_RPC_ENDPOINT`, as well as the configuration file, are all unavailable.
ERROR 02-17 17:03:58 api_server.py:188] RPCServer process died before responding to readiness probe

Steps to Reproduce

kubectl apply -f samples/kvcache/deployment.yaml
kubectl apply -f samples/kvcache/kvcache.yaml

Expected behavior

inference engine should launch successfully

Environment

  • nightly version
@Jeffwan Jeffwan added kind/bug Something isn't working area/kv-cache priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Feb 18, 2025
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Feb 18, 2025

I used wrong endpoint here but even I use wrong one, does IPC connection helps?

            - name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
              value: "aibrix-mode-deepseek-coder-7b-kvcache-rpc:9600"

It should be deepseek-coder-7b-kvcache-rpc:9600

@DwyaneShi
Copy link
Collaborator

I used wrong endpoint here but even I use wrong one, does IPC connection helps?

            - name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
              value: "aibrix-mode-deepseek-coder-7b-kvcache-rpc:9600"

It should be deepseek-coder-7b-kvcache-rpc:9600

RPC is a must in our current implementation.

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Feb 18, 2025

em. the logs is kind of misleading. it complain the "Failed to connect to vineyard via both IPC and RPC connection". Technically, it should be able to connect to the cache via IPC? Is it possible it failed to connect via IPC but only RPC and following requests all send via RPC? Do we have monitoring or logs to verify the data path?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kv-cache kind/bug Something isn't working priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

2 participants