Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

没有拿到可用节点,是什么原因 #10

Open
lanwood opened this issue Dec 26, 2023 · 5 comments
Open

没有拿到可用节点,是什么原因 #10

lanwood opened this issue Dec 26, 2023 · 5 comments

Comments

@lanwood
Copy link

lanwood commented Dec 26, 2023

BASE_IMAGE="registry.cn-hangzhou.aliyuncs.com/hfai/hai-platform:latest"

hai-cli nodes 没有显示节点,scheduler.log 中找到 报错信息

scheduler 0.log:133:2023-12-26 21:25:41.014ERRORticks beater#1703597141000没有拿到可用节点,请检查

@lanwood
Copy link
Author

lanwood commented Dec 28, 2023

hai-cli images load 自定义镜像 报错

AttributeError: 'UserImage' object has no attribute 'async load!

@lanwood
Copy link
Author

lanwood commented Dec 28, 2023

hai-cli nodes 报错

scheduler 0.log:133:2023-12-26 21:25:41.014ERRORticks beater#1703597141000没有拿到可用节点,请检查

core.tml 中的 launcher.task_namespace 需改为 'hai-platform', 还有 manager_nodes 也需要调整

@lanwood
Copy link
Author

lanwood commented Dec 29, 2023

自定义训练镜像满足 validate_image.sh 但无法使用

Volumes:
073c58e81e0042f88595e717c4ebf61d:
Type: HostPath (bare host directory volume)
Path: /data/mnt/hai-platform/workspace/haiadmin
HostPathType: Directory
marsv2-scripts-4:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: marsv2-scripts-4
Optional: false
marsv2-entrypoints-4:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: marsv2-entrypoints-4
Optional: false
start-scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: start-scripts-4-0
Optional: false
log-dir:
Type: HostPath (bare host directory volume)
Path: /data/mnt/hai-platform/workspace/log/haiadmin
HostPathType: DirectoryOrCreate
shm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit:
QoS Class: Burstable
Node-Selectors: kubernetes.io/hostname=k8s-worker-node1
Tolerations: node.kubernetes.io/memory-pressure:NoExecute op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Normal Scheduled 5s default-scheduler Successfully assigned hai-platform/haiadmin-4-0 to k8s-worker-node1
Normal Pulled 4s kubelet Container image "python:3.10-slim" already present on machine
Normal Created 4s kubelet Created container hf-experiment
Normal Started 4s kubelet Started container hf-experiment
Warning FailedMount 1s (x3 over 3s) kubelet MountVolume.SetUp failed for volume "marsv2-entrypoints-4" : object "hai-platform"/"marsv2-entrypoints-4" not registered
Warning FailedMount 1s (x3 over 3s) kubelet MountVolume.SetUp failed for volume "marsv2-scripts-4" : object "hai-platform"/"marsv2-scripts-4" not registered
Warning FailedMount 1s (x3 over 3s) kubelet MountVolume.SetUp failed for volume "start-scripts" : object "hai-platform"/"start-scripts-4-0" not registered

@lanwood
Copy link
Author

lanwood commented Dec 29, 2023

创建任务后如何配置不拉取新的训练镜像,使用本地已有的镜像

@yolunghiu
Copy link

hai-cli nodes 报错

scheduler 0.log:133:2023-12-26 21:25:41.014ERRORticks beater#1703597141000没有拿到可用节点,请检查

core.tml 中的 launcher.task_namespace 需改为 'hai-platform', 还有 manager_nodes 也需要调整

请问您这个问题解决了吗?貌似override.toml中的配置没有生效,改了core.toml重新构建了个镜像,还是获取不到node节点,scheduler_0.log报错:2024-04-12 16:54:36.031 | ERROR | ticks_beater#1712912076000 | 没有拿到可用节点,请检查

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants